Why did we open-source our inference engine? Read the post
← All Posts

SIE vs hosted embedding APIs: ~97% of the quality at ~1/12th the cost

SIE vs hosted embedding APIs: ~97% of the quality at ~1/12th the cost

If you are embedding at scale, you are probably paying a hosted API per token and quietly accepting whatever latency and rate-limit ceiling comes with your usage tier. We wanted to know what you give up by running the embedding models yourself on SIE instead.

So we ran them head to head: SIE’s open embedding models against the hosted frontier (Voyage, OpenAI, and Cohere) across quality, single-request latency, sustained throughput, and cost. The short version is that self-hosting on SIE lands within a few thousandths to a few hundredths of ndcg of the best hosted API, returns embeddings 3 to 7 times faster, never hits a rate-limit wall, and embeds a billion tokens for around $10 instead of $120 to $130.

Here are the numbers, all measured on a closed, reproducible dataset.

Is self-hosted embedding quality actually competitive?

Yes. Across eight standard MTEB retrieval tasks, SIE’s best valid model per task stays within 0.006 to 0.050 ndcg@10 of the hosted frontier, and beats it outright on two of the eight.

The SIE models in this pass are NovaSearch/stella_en_1.5B_v5 and Qwen/Qwen3-Embedding-4B, both open weights. The hosted frontier is voyage-4-large, plus cohere/embed-v4.0 on the Legal and CosQA tasks where it is strongest.

TaskBest SIE modelSIEBest hostedHostedΔ (SIE − hosted)
NFCorpusstella_en_1.5B_v50.4215voyage-4-large0.4279-0.006
SciFactstella_en_1.5B_v50.8039voyage-4-large0.8180-0.014
FiQA2018stella_en_1.5B_v50.5961voyage-4-large0.6309-0.035
LegalBench ConsumerContractsQAQwen3-Embedding-4B0.8152cohere/embed-v4.00.8654-0.050
CosQAQwen3-Embedding-4B0.4006cohere/embed-v4.00.3728+0.028
StackOverflowQAQwen3-Embedding-4B0.9336voyage-4-large0.9797-0.046
SCIDOCSQwen3-Embedding-4B0.2992voyage-4-large0.2539+0.045
CQADupstack PhysicsQwen3-Embedding-4B0.5264voyage-4-large0.5700-0.044

Read it straight: this is competitive, not leading. SIE wins on CosQA and SCIDOCS and trails by small margins everywhere else. Averaged across the set, the SIE frontier holds a mean ndcg of 0.600 against the hosted frontier’s 0.615, which is 97.5% of the quality.

The reason that matters is the price tag attached to those last few hundredths.

How much cheaper is it, really?

About twelve times cheaper per unit of quality. On NFCorpus, SIE delivers 41.3 points of ndcg per dollar against Voyage’s 3.6. The quality gap is a rounding error; the cost gap is an order of magnitude.

The clearest way to feel that is a real ingestion job. Embedding a one-billion-token corpus (roughly 1.95M chunks of 512 tokens) costs this much:

OptionCost for 1B tokensTime on one worker
SIE stella (RTX)$10.223.4 h (1.7 h on two workers)
SIE stella (L4)$15.5319.4 h
SIE Qwen3-Embedding-4B (RTX)$17.295.7 h
voyage-4-lite / openai-3-small$20.00tier-capped
voyage-4$60.00tier-capped
cohere/embed-v4.0$120.00~16 h at flat cap
voyage-4-large$120.00tier-capped
openai/text-embedding-3-large$130.00tier-capped

A frontier-quality embed of a billion tokens runs roughly $10 to $17 on SIE versus $120 to $130 on the hosted frontier, which is 7 to 13 times cheaper. SIE’s stella at $10.22 even undercuts the cheapest hosted lite tier ($20) by about 2x, and you can shrink the wall-clock as far as you like by adding workers.

One honest caveat on the framing: these are sustained, high-utilization ingestion economics, the workload SIE is built for. If your pattern is occasional low-volume query bursts, a hosted API’s pay-per-call model can be the simpler fit. The story below is about embedding at scale.

How fast are the embeddings?

3 to 7 times faster on a single request. SIE returns an embedding in single-digit to low-tens of milliseconds; the hosted APIs sit in the high tens to low hundreds.

p50 latencyvs SIE
SIE bge-m3 (RTX)15 msbaseline
SIE stella / Qwen3-Embedding-4B (RTX)27 msbaseline
cohere/embed-v4.084 ms3.1 to 5.6× slower
openai/text-embedding-3-large157 ms5.8 to 10× slower
voyage-4-large180 ms6.7 to 12× slower

The tail is where it gets lopsided. SIE’s p99 stays within a few milliseconds of its p50 (still tens of ms total), while provider p99 runs into the hundreds and sometimes thousands of milliseconds. When you are embedding inside a request path, that predictability is worth as much as the median.

What about throughput and rate limits?

This is the part the per-token pricing hides. Hosted throughput is capped by your usage tier, full stop. SIE scales linearly with hardware at a constant unit cost, and the math is boring in the best way.

A single SIE worker’s sustained throughput (the corpus “knee,” in tokens per second) and its price look like this:

Sustained tok/s per workerQuery p50$/1M tokens
SIE stella_1.5B (RTX)82,38227 ms$0.0102
SIE Qwen3-Embedding-4B (RTX)48,67527 ms$0.0173
voyage-4-large150,000 (tier cap)180 ms$0.12
openai/text-embedding-3-large166,667 (tier cap)157 ms$0.13
cohere/embed-v4.0~17,067 (flat)84 ms$0.12
voyage/voyage-4-lite800,000 (tier cap)153 ms$0.02

Where one SIE worker trails a provider’s tier ceiling, you close the gap by adding workers, and the unit cost does not move. Nodes are ceil(load / knee), and cost per token is GPU$/hr / (3600 × knee), which is independent of node count.

To matchstella nodesQwen3-Embedding-4B nodesSIE $/1M (held)vs hosted
voyage-4-large 150k tok/s2 RTX4 RTX~$0.010 / ~$0.0177 to 12x cheaper than $0.12
openai-3-large 167k tok/s3 RTX4 RTX~$0.010 / ~$0.0178 to 13x cheaper than $0.13

So you match a provider’s entire off-the-shelf capacity with two to four GPU workers, stay 7 to 12 times cheaper per token, and you are already faster on latency. There is no tier to negotiate and no wall to hit.

Can you trade cost for speed?

Yes, and that lever does not exist on a hosted API. Because you pick the GPU, you pick your operating point on the cost-versus-throughput curve.

ModelL4: tok/s @ $/1MRTX: tok/s @ $/1MPick
e5-base-v2181,292 @ $0.0012681,551 @ $0.0012RTX (3.8x throughput, same cost)
bge-m355,338 @ $0.0040235,430 @ $0.0036RTX (faster and cheaper)
stella_1.5B14,307 @ $0.015582,382 @ $0.0102RTX (faster and cheaper)

For most models the RTX PRO 6000 is both faster and cheaper per token, because it buys more throughput than its hourly premium costs. The L4 is there when you want to floor the hourly spend on lighter models. Either way, you set the dial. A hosted API sets it for you and bills accordingly.

What this benchmark covers, and what’s next

This is the search and embedding slice of the SIE catalog. Embedding is one of the jobs SIE serves, not the whole picture, and we are running this same quality, performance, and cost methodology across the rest of the catalog. Treat this as the first installment, with more of the catalog reported the same way as it lands.

On reproducibility: every figure here comes from a closed observations dataset that passes full provenance checks (verify.py at 194/194 references OK, build.py at zero cache mismatches), measured on Modal GPUs (L4 at $0.80/hr, RTX PRO 6000 at $3.03/hr). Cost figures are the sustained, high-utilization floor at single on-demand GPU pricing, with no high-availability or idle overhead modeled. The scaling rows are extrapolations from measured per-worker throughput using the formulas above, not separately measured at every node count.

The takeaway

If you embed at scale, self-hosting on SIE gets you essentially frontier-quality retrieval at roughly one-twelfth the cost, several times the speed, and no rate-limit ceiling. The hosted APIs still win on zero-ops convenience and on spiky, low-volume workloads. For sustained ingestion and serving, the economics are not close.

Star SIE on GitHub · Read the encode docs · Browse the model catalog

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.