Self-hosted search inference with SIE
Every search and RAG system runs the same two model calls on a loop: embed the text, then rerank the candidates. At low volume you rent both from an API and never think about it. At production volume that loop becomes your largest inference line item, and every query you send out is a query you no longer control.
SIE (the Superlinked Inference Engine) is an open-source, Apache 2.0 inference server that runs the whole retrieval stack on your own infrastructure. Dense and sparse embeddings, multi-vector ColBERT, multimodal search, and cross-encoder reranking all come from one cluster you install in your cloud, behind two primitives: encode and score. This page ties together every search capability, the models behind each one, the migration paths off rented inference, and the source.
In short:
- Cut per-query inference cost. Move high-volume embedding and reranking off metered APIs and onto GPUs you already pay for. Superlinked’s published comparison puts reranking at roughly $8.50 per 1B tokens self-hosted, versus $87 on Cohere Rerank, and embeddings at about $0.50 instead of $20 on a managed API.
- Keep queries and corpora inside your cloud. Search runs over your most sensitive data, and with SIE none of it is sent to a third-party endpoint.
- Replace N single-model containers with one cluster. Dense, sparse, multi-vector, and rerank models share a single deployment instead of one container per model.
Star SIE on GitHub · Read the encode docs · Browse the model catalog
The problem: rented inference for search does not scale cleanly
Search inference is the highest-frequency model traffic most products run. When every embed and every rerank is a metered API call, three problems compound as traffic grows.
Cost. Embedding and reranking are repetitive, predictable, and enormous in volume, which is the worst possible shape for per-token pricing. Reranking is especially punishing because a cross-encoder runs one forward pass per candidate, so reranking 100 results is 100 billed calls per query. The managed-API premium on that work is large and it scales linearly with usage.
Control. Search runs over your private corpus and your users’ raw queries, which is exactly the data customers and regulators care most about. Sending it to a third-party endpoint gives up control of where that data lives and who processes it.
Portability. A retrieval stack pinned to one provider’s API does not move with you into a customer’s cloud or an air-gapped environment, which is where enterprise deployments increasingly need to run.
Self-hosting the small, task-specific models that do this work wins all three back, and the models are good enough that there is little quality reason left to rent. The catalog covers 35 encode and 14 score models, all on open checkpoints.
The cost angle: stop paying per token for embed-and-rerank
The two calls at the heart of search are the two best candidates to bring in-house, because they run constantly and do not need a frontier model.
SIE is built to push those calls through your GPUs at high utilization rather than across a metered boundary:
- One cluster, not N containers. A single SIE instance serves dense, sparse, multi-vector, and rerank models together, loading each on demand and evicting idle ones with LRU. The usual alternative, one Text Embeddings Inference container per model, means N deployments, N health checks, and N autoscalers for the same coverage.
- Full batches per GPU pass. SIE pulls concurrent requests from one shared work queue so mixed sizes pack into full batches. In Superlinked’s benchmarks this reaches about 89% GPU efficiency versus roughly 51% for the route-then-batch pattern, around 1.8x the throughput per GPU at the same latency.
- Storage is a cost lever too. Multi-vector retrieval trades storage for quality, and quantization plus MUVERA give you ways to claw most of that back, so higher-quality retrieval does not force a storage bill to match.
The published cost comparison makes the per-token gap concrete: embeddings at about $0.50 per 1B tokens on your own cloud versus $20 on a managed API, and reranking at about $8.50 versus $87 on Cohere Rerank or $43 on Vertex AI Ranking. The point is not the exact figure, it is that the self-hosted line is flat against hardware while the rented line tracks your traffic.
The security and control angle: queries and corpora stay in your cloud
Search is where a product’s private data and its users’ intent meet, so it is the part of the stack where data residency matters most. With SIE, embedding and reranking run on the host that runs the cluster, and neither the corpus nor the live query is sent to an outside provider.
That control extends to operations without weakening the boundary:
- Per-request model choice. Route one language or domain to one model and another to a different model on each request. With single-model servers you build a gateway in front; with SIE the cluster is the gateway.
- Add and change models without sending anything out. New encoders and rerankers are hot-loaded through the Config API or a GitOps workflow, with the gateway waiting for worker acknowledgement before routing traffic.
- Run air-gapped. Model-weight snapshots let the whole retrieval stack run from mirrored registries with no public network access.
The same cluster installs in your SaaS cloud or inside a customer’s cloud, so the data story is identical wherever you deploy.
The canonical recipe: encode, then score
Most production search is two-stage retrieval. You retrieve a broad candidate set with embeddings, then rerank the top candidates with a cross-encoder. Both stages are one SIE call.
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Stage 1: embed the query, search your vector DB for ~100 candidatesq = client.encode("Qwen/Qwen3-Embedding-0.6B", Item(text="how do refunds work?"), is_query=True)# ... retrieve top_100 from your vector database using q["dense"] ...
# Stage 2: rerank those candidates with a cross-encoderresult = client.score( "mixedbread-ai/mxbai-rerank-large-v2", Item(text="how do refunds work?"), [Item(id=f"doc-{i}", text=d["text"]) for i, d in enumerate(top_100)],)top_10 = [entry["item_id"] for entry in result["scores"][:10]]This lifts precision without reranking the whole corpus. Everything below is a variation on these two primitives.
Capabilities: the retrieval stack, one SDK call each
Every block is the same client.encode(...) or client.score(...) call. Swap the model identifier to swap the architecture; SIE hot-loads the weights on first use.
1. Dense embeddings
Fixed-dimension vectors that capture meaning, for semantic search, RAG, and recommendations. Encode the query side asymmetrically when the model supports it.
Models: Qwen/Qwen3-Embedding-0.6B, intfloat/multilingual-e5-large, google/embeddinggemma-300m, BAAI/bge-m3.
docs = [Item(id=f"doc-{i}", text=t) for i, t in enumerate(corpus)]vectors = client.encode("Qwen/Qwen3-Embedding-0.6B", docs) # store in your vector DB
q = client.encode("Qwen/Qwen3-Embedding-0.6B", Item(text="how do refunds work?"), is_query=True)Docs: Encode overview · Models: catalog
2. Sparse and hybrid search
Sparse vectors assign weights to vocabulary tokens, so exact term matching (product codes, proper nouns, acronyms) works alongside semantic search. Run a dense encoder and a sparse model, then combine scores in your vector database. BAAI/bge-m3 can also emit dense and sparse in one call when you want a single-model path.
Models: naver/splade-v3, Qwen/Qwen3-Embedding-0.6B, BAAI/bge-m3, opensearch-project/opensearch-neural-sparse-*.
# Dense + sparse as two calls, then combine in your vector DBdense = client.encode( "Qwen/Qwen3-Embedding-0.6B", Item(text="ACME-1000 refund policy"), is_query=True,)sparse = client.encode( "naver/splade-v3", Item(text="ACME-1000 refund policy"), is_query=True,)# Most databases support: final_score = alpha * dense_score + (1 - alpha) * sparse_scoreSparse retrieval is supported natively by Elasticsearch, OpenSearch, Qdrant, Weaviate, Milvus, and Pinecone.
Docs: Sparse & hybrid search · Models: catalog
3. Multi-vector and ColBERT
Per-token embeddings with late interaction (MaxSim) scoring capture fine-grained term matching that single-vector dense embeddings miss. The SDK ships a maxsim helper, and ColBERT models expand short queries with MASK tokens automatically.
Models: jinaai/jina-colbert-v2, lightonai/GTE-ModernColBERT-v1, answerdotai/answerai-colbert-small-v1, mixedbread-ai/mxbai-colbert-large-v1.
from sie_sdk.scoring import maxsim
q = client.encode("jinaai/jina-colbert-v2", Item(text="what is late interaction?"), output_types=["multivector"], is_query=True)docs = client.encode("jinaai/jina-colbert-v2", candidate_items, output_types=["multivector"])
scores = maxsim(q["multivector"], [d["multivector"] for d in docs])If you would rather use a standard HNSW index, the muvera profile converts multi-vector output to a fixed-dimension dense vector for ColBERT-quality retrieval on databases without multi-vector support, at a documented 5 to 10% quality trade-off.
Docs: Multi-vector & ColBERT · Models: catalog
4. Multimodal embeddings
Encode images and text into a shared space for cross-modal search, so a text query can retrieve images and vice versa. Same encode call, an image-capable model.
Models: google/siglip-so400m-patch14-384, laion/CLIP-ViT-H-14-laion2B-s32B-b79K, openai/clip-vit-large-patch14.
text_vec = client.encode("google/siglip-so400m-patch14-384", Item(text="a red leather handbag"))img_vec = client.encode("google/siglip-so400m-patch14-384", Item(images=[{"data": img_bytes, "format": "jpeg"}]))# text_vec and img_vec live in the same space; compare directlyDocs: Multimodal · Models: catalog
5. Quantization
Cut the storage and memory cost of your index by quantizing embeddings, so higher-quality vectors do not force a proportional storage bill. This is the lever that keeps multi-vector and large-dimension models affordable at corpus scale.
Docs: Quantization
6. Reranking
Cross-encoders see the query and document together in one forward pass, which is more accurate than comparing embeddings independently. This is stage two of the canonical recipe, and the highest-leverage quality win in most pipelines.
Models: mixedbread-ai/mxbai-rerank-large-v2, jinaai/jina-reranker-v2-base-multilingual, BAAI/bge-reranker-v2-m3, plus ColBERT multi-vector reranking.
query = Item(text="how do refunds work?")result = client.score("mixedbread-ai/mxbai-rerank-large-v2", query, candidate_chunks)top_ids = [entry["item_id"] for entry in result["scores"][:10]]Docs: Score overview · Reranker models · Models: catalog
Search models, all in one cluster
Filter the full catalog by the encode and score tasks, or by output type. Every model below is served by the same SIE instance and loaded on demand.
| Model | Stage | Output | Best for |
|---|---|---|---|
| Qwen/Qwen3-Embedding-0.6B | Encode | Dense | High throughput, small footprint |
| intfloat/multilingual-e5-large | Encode | Dense | Multilingual dense retrieval |
| google/embeddinggemma-300m | Encode | Dense | Fast, lightweight general-purpose |
| naver/splade-v3 | Encode | Sparse | Purpose-built sparse retrieval (SPLADE) |
| BAAI/bge-m3 | Encode | Dense / Sparse / Multi-Vec | Dense, sparse, and multi-vector in one call |
| jinaai/jina-colbert-v2 | Encode | Multi-Vec | Long-context ColBERT late interaction (8192) |
| lightonai/GTE-ModernColBERT-v1 | Encode | Multi-Vec | ModernBERT late interaction, long context |
| answerdotai/answerai-colbert-small-v1 | Encode | Multi-Vec | Smallest, fastest ColBERT |
| google/siglip-so400m-patch14-384 | Encode | Dense (multimodal) | Text and image in a shared space |
| mixedbread-ai/mxbai-rerank-large-v2 | Score | Score | English cross-encoder reranking |
| jinaai/jina-reranker-v2-base-multilingual | Score | Score | Multilingual reranking |
| Alibaba-NLP/gte-reranker-modernbert-base | Score | Score | Low-latency ModernBERT reranker |
| BAAI/bge-reranker-v2-m3 | Score | Score | Multilingual cross-encoder reranking |
See Choosing a Model for selection guidance and Evals for how SIE measures retrieval quality. The retrieval benchmark example compares full pipelines head to head.
Drop into your existing stack
SIE is the inference layer, not the database or the framework, so it sits behind the tools you already use.
Vector databases: Chroma, Qdrant, Weaviate, and LanceDB. Their founders describe pairing the database with SIE so indexing, scoring, filtering, and ranking models all run in one self-hosted cluster.
Frameworks: LangChain, LlamaIndex, Haystack, and DSPy, each with a retriever and reranker component backed by SIE.
There is also an always-on OpenAI-compatible /v1/embeddings endpoint, so existing embedding clients point at SIE with a URL change.
Docs: Integrations overview · VDB comparison
Migrating off rented search inference
The headline migration is N single-model containers to one cluster. If you run several Text Embeddings Inference containers, SIE serves the same checkpoints from one process, selects the model per request, and exposes typed dense, sparse, and multivector outputs in one call instead of separate endpoints. Staying on the same checkpoint means no re-embedding: the cosine drift between TEI’s backend and SIE’s PyTorch backend sits at or above 0.999, well below any retrieval-quality threshold.
The same before-and-after pattern is documented for every common source:
- TEI to SIE — the headline N-containers-to-one-cluster path
- Cohere to SIE — self-host reranking off the Rerank API
- OpenAI to SIE — replace metered embeddings
- Infinity to SIE and Fastembed to SIE
- Modal to SIE
When to keep what you have: for two or three pinned models at high QPS, single-model containers behind an ingress are simpler. SIE earns its place once you have several models in active use, a long tail of sometimes-used rerankers or language variants, or mixed modalities in one request path.
Runnable examples
- Self-hosted product search in 5 min — an Amazon-style product search engine on a laptop, using all three primitives through three SDK calls.
- Find the best retrieval strategy for your RAG — a head-to-head ablation across 7 encoder, reranker, and multi-vector pipelines on 1,854 SEC 10-K queries, ranked by NDCG@10.
- Find SOTA embedding models by MTEB task — describe a task in plain language and search ~14K Hugging Face embedding models by task-specific MTEB score.
- Build a multi-modal product classifier with embeddings — NLI, text retrieval, image retrieval, and cross-encoder reranking evaluated on a real product taxonomy.
Deployment: the same layer in any cloud
SIE installs as a Kubernetes inference cluster inside the environment your application already runs in. The same Docker image runs on a laptop and in production.
# Pull and serve on your own boxdocker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default
# Or deploy the cluster to your VPChelm install sie oci://ghcr.io/superlinked/charts/sie-cluster
# Use it from anywhere with the SDKpip install sie-sdkA Helm chart for EKS and GKE, Terraform modules for AWS, GCP, and Azure, and model-weight snapshots for air-gapped environments.
Docs: Deployment · Air-gapped
Search resource hub
Docs: Encode (dense) · Sparse & hybrid · Multi-vector & ColBERT · Multimodal · Quantization · Score (reranking) · Reranker models
Examples: Product search · Retrieval benchmark · MTEB model search · Taxonomy classification · All examples
Integrations: Chroma · Qdrant · Weaviate · LanceDB · LangChain · LlamaIndex
Migrate: TEI · Cohere · OpenAI · Infinity · Fastembed
Model catalog: Encode models · Score models · Full catalog
GitHub & SDK: superlinked/sie · Quickstart · Python SDK reference
FAQ
How does self-hosting search inference cut cost? Embedding and reranking are high-frequency, repetitive calls, which is the worst fit for per-token pricing. Moving them to GPUs you control turns a usage-linked bill into a fixed hardware cost. SIE serves many models from one cluster, keeps frequent ones resident, evicts idle ones with LRU, and batches concurrent requests for high GPU efficiency. Superlinked’s published comparison puts self-hosted reranking near $8.50 per 1B tokens against $87 on Cohere Rerank, and embeddings near $0.50 against $20 on a managed API.
Can I run hybrid search (dense plus sparse) self-hosted?
Yes. Encode with a dense model such as Qwen/Qwen3-Embedding-0.6B and a sparse model such as naver/splade-v3, then combine them in your vector database with a weighted score. BAAI/bge-m3 can also return dense and sparse vectors in a single encode call. SPLADE and OpenSearch neural sparse models are also available. Sparse retrieval is supported natively by Elasticsearch, OpenSearch, Qdrant, Weaviate, Milvus, and Pinecone.
Is SIE a self-hosted alternative to TEI or the Cohere and OpenAI APIs?
Yes. SIE replaces N single-model TEI containers with one cluster that selects the model per request and exposes typed dense, sparse, and multivector outputs. Cohere reranking and OpenAI embeddings have documented migration paths, and an OpenAI-compatible /v1/embeddings endpoint means existing clients move with a URL change. Staying on the same checkpoint needs no re-embedding.
Does SIE support ColBERT and late-interaction retrieval?
Yes. Multi-vector models such as jina-colbert-v2 and GTE-ModernColBERT produce per-token embeddings, and the SDK includes a maxsim helper for late-interaction scoring. The muvera profile converts multi-vector output to a fixed-dimension dense vector so you can get ColBERT-quality retrieval on a standard HNSW index, at a documented 5 to 10% quality trade-off.
Do my queries or corpus leave my infrastructure? No. Embedding and reranking run on the host that runs the cluster, with nothing sent to a third-party endpoint. SIE installs in your SaaS cloud or a customer’s cloud, supports air-gapped deployment from mirrored registries, and lets you add or change models through the Config API without external calls.
The takeaway
Run the search models behind your product on your own terms. Get started on GitHub or read the docs.