Why did we open-source our inference engine? Read the post

Model catalog

Every model we serve, one primitive at a time. Pick a benchmark to rank by quality; the hardware control drives latency, throughput and cost.

Task (optional)
Primitive
drives latency, throughput & cost across every table on this page
Tags (AND)
Size
Model Size Quality Latency Throughput Cost $/1M
Alibaba-NLP/gte-Qwen2-7B-instruct
Long contextDense
7.6B 0.4040ndcg@10 846 ms 3.5K tok/s $0.063
GritLM/GritLM-7B
Dense
7.2B 0.3972ndcg@10 2.1 s 1.4K tok/s $0.157
Linq-AI-Research/Linq-Embed-Mistral
Long contextDense
7.1B 0.4066ndcg@10 818 ms 2.9K tok/s $0.075
Salesforce/SFR-Embedding-2_R
Long contextDense
7.1B 0.4285ndcg@10 682 ms 2.9K tok/s $0.076
Salesforce/SFR-Embedding-Mistral
Dense
7.1B 0.4085ndcg@10 888 ms 3.0K tok/s $0.075
intfloat/e5-mistral-7b-instruct
Dense
7.1B 0.3932ndcg@10 915 ms 3.0K tok/s $0.074
vidore/colqwen2.5-v0.2
MultimodalMulti-vector
7.0B 1.9 s 7.6 mpix/s
nvidia/llama-nemoretriever-colembed-3b-v1
MultimodalMultilingualLong contextMulti-vector
4.4B 6.1 s 0.7 img/s
Qwen/Qwen3-Embedding-4B
Long contextDense
4.0B 0.4103ndcg@10 464 ms 5.7K tok/s $0.039
vidore/colpali-v1.3-hf
MultimodalMulti-vector
3.0B 582 ms 23.0 mpix/s
Qwen/Qwen3-VL-Embedding-2B
MultimodalLong contextDense
2.1B 36 ms 494 tok/s $0.450
Alibaba-NLP/gte-Qwen2-1.5B-instruct
Long contextDense
1.8B 0.2547ndcg@10 261 ms 12.3K tok/s $0.018
NovaSearch/stella_en_1.5B_v5
Dense
1.5B 0.4219ndcg@10 258 ms 12.8K tok/s $0.017
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
MultimodalDense
986M 353 ms 438 tok/s $0.508
google/siglip-so400m-patch14-384
MultimodalDense
878M 347 ms 451 tok/s $0.493
google/siglip-so400m-patch14-224
MultimodalDense
877M 284 ms 456 tok/s $0.487
Qwen/Qwen3-Embedding-0.6B
Long contextDense
596M 0.3689ndcg@10 157 ms 20.6K tok/s $0.011
BAAI/bge-m3
Long contextDenseSparseMulti-vector
568M 0.3144ndcg@10 93 ms 33.2K tok/s $0.0067
Snowflake/snowflake-arctic-embed-l-v2.0
MultilingualLong contextDense
568M 0.3519ndcg@10
intfloat/multilingual-e5-large
MultilingualDense
560M 0.3063ndcg@10 109 ms 29.8K tok/s $0.0074
intfloat/multilingual-e5-large-instruct
MultilingualDense
560M 0.3521ndcg@10 107 ms 29.4K tok/s $0.0076
jinaai/jina-colbert-v2
MultilingualLong contextMulti-vector
559M 0.3583ndcg@10 106 ms 28.5K tok/s $0.0078
nomic-ai/nomic-embed-text-v2-moe
MultilingualDense
475M 150 ms 13.0K tok/s $0.017
NovaSearch/stella_en_400M_v5
Dense
435M 0.4125ndcg@10 116 ms 27.1K tok/s $0.0082
openai/clip-vit-large-patch14
MultimodalDense
428M 228 ms 977 tok/s $0.227
google/siglip2-base-patch16-224
MultimodalDense
375M 69 ms 1.6K tok/s $0.140
mixedbread-ai/mxbai-colbert-large-v1
Multi-vector
335M 0.3467ndcg@10 75 ms 43.3K tok/s $0.0051
intfloat/e5-large-v2
Dense
335M 0.3715ndcg@10 87 ms 33.2K tok/s $0.0067
mixedbread-ai/mxbai-embed-large-v1
Dense
335M 0.3865ndcg@10
Alibaba-NLP/gte-multilingual-base
MultilingualLong contextDense
305M 0.3677ndcg@10 57 ms 55.1K tok/s $0.0040
Snowflake/snowflake-arctic-embed-m-v2.0
MultilingualLong contextDense
305M 0.2489ndcg@10
google/embeddinggemma-300m
Dense
303M 0.2619ndcg@10 87 ms 27.2K tok/s $0.0082
Marqo/marqo-fashionSigLIP
MultimodalDense
203M
laion/CLIP-ViT-B-32-laion2B-s34B-b79K
MultimodalDense
151M 219 ms 1.0K tok/s $0.218
openai/clip-vit-base-patch32
MultimodalDense
151M 234 ms 958 tok/s $0.232
lightonai/GTE-ModernColBERT-v1
Long contextMulti-vector
149M 0.3618ndcg@10 104 ms 28.0K tok/s $0.0079
lightonai/Reason-ModernColBERT
Long contextMulti-vector
149M 0.3580ndcg@10 82 ms 33.0K tok/s $0.0067
Alibaba-NLP/gte-modernbert-base
Long contextDense
149M 0.3664ndcg@10
ibm-granite/granite-embedding-english-r2
Long contextDense
149M 0.3450ndcg@10
nomic-ai/modernbert-embed-base
Long contextDense
149M 0.3337ndcg@10
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte
Sparse
137M 0.3524ndcg@10 94 ms 34.2K tok/s $0.0065
opensearch-project/opensearch-neural-sparse-encoding-v1
Sparse
133M 0.3600ndcg@10 69 ms 48.7K tok/s $0.0046
naver/splade-cocondenser-selfdistil
Sparse
110M 0.3403ndcg@10 72 ms 40.0K tok/s $0.0056
naver/splade-v3
Sparse
110M 0.3404ndcg@10 84 ms 29.6K tok/s $0.0075
prithivida/Splade_PP_en_v2
Sparse
110M 0.3161ndcg@10 55 ms 57.5K tok/s $0.0039
colbert-ir/colbertv2.0
Multi-vector
110M 0.2647ndcg@10 66 ms 43.0K tok/s $0.0052
intfloat/e5-base-v2
Dense
109M 0.3541ndcg@10 58 ms 53.2K tok/s $0.0042
opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill
Sparse
67M 0.3396ndcg@10 63 ms 49.1K tok/s $0.0045
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill
Sparse
67M 0.3294ndcg@10 61 ms 50.1K tok/s $0.0044
opensearch-project/opensearch-neural-sparse-encoding-v2-distill
Sparse
67M 0.3373ndcg@10 63 ms 44.2K tok/s $0.0050
ibm-granite/granite-embedding-small-english-r2
Long contextDense
48M 0.3016ndcg@10
answerdotai/answerai-colbert-small-v1
Multi-vector
33M 0.3715ndcg@10 48 ms 59.1K tok/s $0.0038
intfloat/e5-small-v2
Dense
33M 0.3195ndcg@10 50 ms 58.3K tok/s $0.0038
mixedbread-ai/mxbai-edge-colbert-v0-32m
Long contextMulti-vector
32M 0.3376ndcg@10 60 ms 45.9K tok/s $0.0048
ibm-granite/granite-embedding-30m-sparse
Sparse
30M 0.3147ndcg@10 105 ms 31.9K tok/s $0.0070
opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini
Sparse
23M 0.3267ndcg@10 55 ms 51.1K tok/s $0.0044
sentence-transformers/all-MiniLM-L6-v2
Dense
23M 0.2324ndcg@10 53 ms 55.3K tok/s $0.0040
rasyosef/splade-mini
Sparse
11M 0.3090ndcg@10 56 ms 56.3K tok/s $0.0039

Latency, throughput and cost are shown only where we've benchmarked the model on the selected GPU; "—" means we don't have a measurement there. Cost is approximate — computed from list GPU prices; your actual price depends on the provider you deploy SIE with.

Run any model here on your own hardware

Every model in this catalog is available on SIE, the open-source Superlinked Inference Engine. One Apache 2.0 cluster serves embeddings, rerankers, and extractors through encode, score, and extract.

Star the repo to follow releases and new models, or run the quickstart and serve your first model locally in minutes.

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.