Skip to content
Why did we open-source our inference engine? Read the post

Find the best retrieval strategy for your RAG

RAG quality lives or dies by the retrieval step. Most teams pick a retrieval pipeline by feel, by averaged leaderboard scores, or by what is already in the stack. None of those tell you how each strategy performs on your data, so the surprises show up in production.

The cure is straightforward to describe and historically painful to run: define a representative benchmark, evaluate every reasonable retrieval strategy on it, and pick by a single metric. The painful part has always been infrastructure. Most teams give up after wiring two or three different model serving stacks.

This example shows what that workflow looks like when one inference cluster can serve every retrieval, reranking, and multi-vector model the experiment needs. The result is the recipe below, the methodology that produced it, and the numbers that ruled out every alternative.

Six bank 10-K filings from SEC EDGAR, 1,854 real questions, 2,942 pages, eight retrieval strategies, ranked by NDCG@10.

Pipeline Results

Dual multi-vector retrieval, then cross-encoder rerank.

  1. Encode queries and pages with two complementary multi-vector models, BAAI/bge-m3 (1024d) and jinaai/jina-colbert-v2 (128d).
  2. Rerank the union of both candidate pools with mixedbread-ai/mxbai-rerank-large-v2.

On the same 1,854 queries this pipeline reaches NDCG@10 = 0.621 and Recall@10 = 0.665, 57% better than a single dense model and 3x better than BM25 alone.

Built by @NirantK for Superlinked.

from sie_sdk import SIEAsyncClient
async with SIEAsyncClient("http://your-sie-endpoint:8080", api_key="SL-...") as sie:
# Multi-vector encode with two complementary models
mv_bge = await sie.encode("BAAI/bge-m3", [{"text": "quarterly revenue"}],
output_types=["multivector"])
mv_jina = await sie.encode("jinaai/jina-colbert-v2", [{"text": "quarterly revenue"}],
output_types=["multivector"])
# Union the two pools, rerank with a cross-encoder
result = await sie.score("mixedbread-ai/mxbai-rerank-large-v2",
query={"text": "quarterly revenue"},
items=[{"text": "Revenue was $50B..."},
{"text": "The board met on Tuesday..."}])

Three model families, one endpoint, no container orchestration.

uv sync
# Validate config (no GPU needed)
uv run python benchmark_ablation.py --dry-run
# Full benchmark across all 7 models and all 1,854 queries
uv run python benchmark_ablation.py --gpu l4-spot

All expensive operations (encoding, search) cache to cache/ablation/. Re-runs skip completed steps. Cross-encoder reranking checkpoints every 100 queries for crash recovery.

vidore_v3_finance_en: six bank 10-K filings from SEC EDGAR.

PropertyValue
Pages2,942
Queries1,854
Relevance judgments8,766 (1=relevant, 2=highly relevant)
Avg relevant docs per query4.7
Median page text3,809 chars (~950 tokens)

Six conditions isolate the contribution of each pipeline stage. Conditions 4 and 5 rerank the same hybrid pool, only the reranker changes. bge-m3 plays three different roles to isolate representation type from retrieval strategy.

#ConditionRetrieverRerankerWhat it isolates
1BM25-onlyTurbopuffer FTSnoneKeyword baseline
2Vector-onlybge-m3 dense ANNnoneSemantic baseline
3RRFFused (k=60)noneDoes hybrid beat vector?
4CE rerankHybrid poolmxbai-rerank, bge-rerankerCross-encoder value
5MV rerankHybrid pool5 ColBERT modelsMV vs cross-encoder
6MV directBrute-force MaxSim5 ColBERT modelsMV as standalone retriever

Metrics: NDCG@10, MRR@10, Recall@10. All computed against the official qrels.

Ranked by NDCG@10. Larger candidate pools (TOP_K=50) consistently improved CE reranking.

StrategyModelsNDCG@10Recall@10
Dual-MV pool, then CE rerankbge-m3 + jina-colbert, then mxbai-large0.6210.665
MV-bge200 pool, then CE rerankbge-m3 MV pool, then mxbai-large0.6130.656
CE rerankmxbai-rerank-large alone0.6000.640
MV directbge-m3 (1024d)0.4350.482
MV rerankjina-colbert-v2 (128d)0.4310.494
Vectorbge-m3 dense0.3960.438
RRF (BM25 + Vector)n/a0.3580.434
BM25Turbopuffer FTS0.1850.239

The full 15-row primary ablation, the reranker sweep, and the pool-composition experiments live in RESULTS.md.

  1. Cross-encoder reranking wins: mxbai-large with the dual-MV pool (0.621) beats every retriever-only setup. The CE step is a bigger lever than picking a different retriever.
  2. Pool recall is the real bottleneck: CE scores 0.69 within-pool. The dual-MV pool reaches 0.92 recall and tops the table; a hybrid-50 pool gets 0.77 recall and trails.
  3. Two MV models beat one: combining bge-m3 and jina-colbert-v2 pools beats either alone. Model diversity adds recall.
  4. jina-colbert-v2 is the cost / quality sweet spot: 96% of bge-m3 MV quality at 12.5% storage (128d vs 1024d).
  5. RRF hurts on this dataset: BM25 dilutes the strong vector signal. The hybrid-fusion baseline scores worse than vector alone.
  • Best quality: dual MV pool, then mxbai-rerank-large CE rerank (NDCG=0.621).
  • No GPU at inference: MV direct with bge-m3 (NDCG=0.435). Pre-encode offline, score with MaxSim on CPU.
  • Best cost / quality: jina-colbert-v2 multi-vector (NDCG=0.431). 8x less storage than bge-m3 MV at near-identical quality.
ServiceWhat forGet one at
SIEEncoding, scoring, multi-vectorSelf-hosted (deploy guide) or contact the team
TurbopufferBM25 + vector search index (free tier covers this dataset)turbopuffer.com
HuggingFaceDataset download (free, cached)huggingface.co/settings/tokens

Create .env in this directory:

SIE_BASE_URL=http://your-sie-endpoint:8080
# Optional: only needed for managed/auth-enabled SIE clusters.
SIE_API_KEY=
TURBOPUFFER_API_KEY=tpuf_...

By Nirant Kasliwal.

Contact us

Tell us about your use case and we'll get back to you shortly.