Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Multi-Vector Search?

Multi-vector search is a retrieval technique where each document is represented by multiple vectors (one per token or passage) rather than a single fixed-size vector. At query time, the query’s token vectors are compared against all document token vectors, enabling fine-grained token-level matching that captures nuanced relevance signals that single-vector retrieval misses.


Why does multi-vector search matter?

Single-vector retrieval compresses an entire document into one vector, losing fine-grained detail in the process. A query about a specific clause in a legal contract, or a precise technical term in a research paper, may not match well against a document-level summary vector, even if the exact answer is present in the document.

Multi-vector search solves this by preserving token-level representations. The matching happens at the token level, so a specific query term can find its exact counterpart in a long document, even if the overall document is only partially relevant.


How does multi-vector search work?

Instead of pooling token representations into one vector:

  1. Encode document → retain one vector per token: [v₁, v₂, ..., vₙ]
  2. Encode query → retain one vector per token: [q₁, q₂, ..., qₘ]
  3. Score with MaxSim → for each query token, find its maximum similarity across all document tokens, then sum:
Score(Q, D) = Σᵢ max_j (qᵢ · dⱼ)

This is the ColBERT scoring mechanism. Every query token gets matched to its best corresponding document token, and these scores are summed into a final relevance score.


Multi-vector vs single-vector vs sparse retrieval

Single-vectorMulti-vector (ColBERT)Sparse (BM25)
Vectors per doc1N (one per token)Vocab-size sparse
Captures semantics✓ (token-level)
Handles exact terms
Storage costLowHighMedium
Retrieval speedFastestSlowerFast
AccuracyGoodHighestGood for keywords

Multi-vector retrieval achieves the highest accuracy but at significant storage cost: a 512-token document produces 512 vectors instead of 1.


What is BGE-M3’s multi-vector capability?

BGE-M3 is unique in supporting all three retrieval modes from a single model, including multi-vector. This means you can produce ColBERT-style multi-vector representations without a separate model:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Encode with multi-vector (ColBERT-style) output
results = client.encode(
"BAAI/bge-m3",
[Item(text=d) for d in documents],
output_types=["dense", "sparse", "multivector"],
)
dense_vectors = [r["dense"] for r in results]
sparse_vectors = [r["sparse"] for r in results]
colbert_vectors = [r["multivector"] for r in results] # one [num_tokens, 128] array per doc

You can then combine all three signals for maximum retrieval accuracy, the approach used in BGE-M3’s MIRACL and BEIR benchmark results.


Multi-vector retrieval is worth the extra storage and compute when:

  • High-precision retrieval is critical: legal, medical, or compliance document search where missing a relevant clause has real consequences
  • Long documents: single vectors compress too much information out of long texts; token-level matching preserves it
  • Specific term lookup: when queries contain precise technical terms that need exact matching alongside semantic understanding
  • You’re combining with reranking: use multi-vector for first-stage retrieval to maximise recall, then a reranker for precision

For most general-purpose search, single-vector with a reranker achieves comparable quality at lower infrastructure cost.


Storage considerations for multi-vector

A 512-token document produces 512 vectors of 128 dimensions each (ColBERT uses smaller per-token dimensions). For 1 million documents:

  • Single-vector (768 dims, float32): ~3GB
  • Multi-vector ColBERT (512 tokens × 128 dims): ~256GB

This is why multi-vector is used selectively, often for a high-value subset of your corpus, with single-vector covering the rest.

Qdrant and Weaviate both support multi-vector indexing natively.


Frequently asked questions

Is multi-vector search the same as ColBERT? ColBERT is the most prominent multi-vector retrieval architecture. Multi-vector search is the broader category; ColBERT is one implementation using late interaction (MaxSim scoring).

Can I use multi-vector retrieval with any vector database? Not all vector databases support multi-vector natively. Qdrant supports it via multi-vectors. Weaviate has ColBERT support. Check your vector DB’s documentation before committing to a multi-vector approach.

Does SIE support multi-vector encoding? Yes. BGE-M3 on SIE can return ColBERT-style token vectors alongside dense and sparse representations in a single encode call.


Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.