Embeddings June 25, 2026

SIE vs hosted embedding APIs: ~97% of the quality at ~1/12th the cost

By Superlinked

If you are embedding at scale, you are probably paying a hosted API per token and quietly accepting whatever latency and rate-limit ceiling comes with your usage tier. We wanted to know what you give up by running the embedding models yourself on SIE instead.

So we ran them head to head: SIE’s open embedding models against the hosted frontier (Voyage, OpenAI, and Cohere) across quality, single-request latency, sustained throughput, and cost. The short version is that self-hosting on SIE lands within a few thousandths to a few hundredths of ndcg of the best hosted API, returns embeddings 3 to 7 times faster, never hits a rate-limit wall, and embeds a billion tokens for around $10 instead of $120 to $130.

Here are the numbers, all measured on a closed, reproducible dataset.

Is self-hosted embedding quality actually competitive?

Yes. Across eight standard MTEB retrieval tasks, SIE’s best valid model per task stays within 0.006 to 0.050 ndcg@10 of the hosted frontier, and beats it outright on two of the eight.

The SIE models in this pass are NovaSearch/stella_en_1.5B_v5 and Qwen/Qwen3-Embedding-4B, both open weights. The hosted frontier is voyage-4-large, plus cohere/embed-v4.0 on the Legal and CosQA tasks where it is strongest.

Task	Best SIE model	SIE	Best hosted	Hosted	Δ (SIE − hosted)
NFCorpus	`stella_en_1.5B_v5`	`0.4215`	`voyage-4-large`	`0.4279`	`-0.006`
SciFact	`stella_en_1.5B_v5`	`0.8039`	`voyage-4-large`	`0.8180`	`-0.014`
FiQA2018	`stella_en_1.5B_v5`	`0.5961`	`voyage-4-large`	`0.6309`	`-0.035`
LegalBench ConsumerContractsQA	`Qwen3-Embedding-4B`	`0.8152`	`cohere/embed-v4.0`	`0.8654`	`-0.050`
CosQA	`Qwen3-Embedding-4B`	`0.4006`	`cohere/embed-v4.0`	`0.3728`	`+0.028`
StackOverflowQA	`Qwen3-Embedding-4B`	`0.9336`	`voyage-4-large`	`0.9797`	`-0.046`
SCIDOCS	`Qwen3-Embedding-4B`	`0.2992`	`voyage-4-large`	`0.2539`	`+0.045`
CQADupstack Physics	`Qwen3-Embedding-4B`	`0.5264`	`voyage-4-large`	`0.5700`	`-0.044`

Read it straight: this is competitive, not leading. SIE wins on CosQA and SCIDOCS and trails by small margins everywhere else. Averaged across the set, the SIE frontier holds a mean ndcg of 0.600 against the hosted frontier’s 0.615, which is 97.5% of the quality.

The reason that matters is the price tag attached to those last few hundredths.

How much cheaper is it, really?

About twelve times cheaper per unit of quality. On NFCorpus, SIE delivers 41.3 points of ndcg per dollar against Voyage’s 3.6. The quality gap is a rounding error; the cost gap is an order of magnitude.

The clearest way to feel that is a real ingestion job. Embedding a one-billion-token corpus (roughly 1.95M chunks of 512 tokens) costs this much:

Option	Cost for 1B tokens	Time on one worker
SIE `stella` (RTX)	`$10.22`	`3.4 h` (`1.7 h` on two workers)
SIE `stella` (L4)	`$15.53`	`19.4 h`
SIE `Qwen3-Embedding-4B` (RTX)	`$17.29`	`5.7 h`
`voyage-4-lite` / `openai-3-small`	`$20.00`	tier-capped
`voyage-4`	`$60.00`	tier-capped
`cohere/embed-v4.0`	`$120.00`	`~16 h` at flat cap
`voyage-4-large`	`$120.00`	tier-capped
`openai/text-embedding-3-large`	`$130.00`	tier-capped

A frontier-quality embed of a billion tokens runs roughly $10 to $17 on SIE versus $120 to $130 on the hosted frontier, which is 7 to 13 times cheaper. SIE’s stella at $10.22 even undercuts the cheapest hosted lite tier ($20) by about 2x, and you can shrink the wall-clock as far as you like by adding workers.

One honest caveat on the framing: these are sustained, high-utilization ingestion economics, the workload SIE is built for. If your pattern is occasional low-volume query bursts, a hosted API’s pay-per-call model can be the simpler fit. The story below is about embedding at scale.

How fast are the embeddings?

3 to 7 times faster on a single request. SIE returns an embedding in single-digit to low-tens of milliseconds; the hosted APIs sit in the high tens to low hundreds.

	p50 latency	vs SIE
SIE `bge-m3` (RTX)	`15 ms`	baseline
SIE `stella` / `Qwen3-Embedding-4B` (RTX)	`27 ms`	baseline
`cohere/embed-v4.0`	`84 ms`	`3.1` to `5.6×` slower
`openai/text-embedding-3-large`	`157 ms`	`5.8` to `10×` slower
`voyage-4-large`	`180 ms`	`6.7` to `12×` slower

The tail is where it gets lopsided. SIE’s p99 stays within a few milliseconds of its p50 (still tens of ms total), while provider p99 runs into the hundreds and sometimes thousands of milliseconds. When you are embedding inside a request path, that predictability is worth as much as the median.

What about throughput and rate limits?

This is the part the per-token pricing hides. Hosted throughput is capped by your usage tier, full stop. SIE scales linearly with hardware at a constant unit cost, and the math is boring in the best way.

A single SIE worker’s sustained throughput (the corpus “knee,” in tokens per second) and its price look like this:

	Sustained tok/s per worker	Query p50	$/1M tokens
SIE `stella_1.5B` (RTX)	`82,382`	`27 ms`	`$0.0102`
SIE `Qwen3-Embedding-4B` (RTX)	`48,675`	`27 ms`	`$0.0173`
`voyage-4-large`	`150,000` (tier cap)	`180 ms`	`$0.12`
`openai/text-embedding-3-large`	`166,667` (tier cap)	`157 ms`	`$0.13`
`cohere/embed-v4.0`	`~17,067` (flat)	`84 ms`	`$0.12`
`voyage/voyage-4-lite`	`800,000` (tier cap)	`153 ms`	`$0.02`

Where one SIE worker trails a provider’s tier ceiling, you close the gap by adding workers, and the unit cost does not move. Nodes are ceil(load / knee), and cost per token is GPU$/hr / (3600 × knee), which is independent of node count.

To match	`stella` nodes	`Qwen3-Embedding-4B` nodes	SIE $/1M (held)	vs hosted
`voyage-4-large` 150k tok/s	`2` RTX	`4` RTX	`~$0.010 / ~$0.017`	7 to 12x cheaper than `$0.12`
`openai-3-large` 167k tok/s	`3` RTX	`4` RTX	`~$0.010 / ~$0.017`	8 to 13x cheaper than `$0.13`

So you match a provider’s entire off-the-shelf capacity with two to four GPU workers, stay 7 to 12 times cheaper per token, and you are already faster on latency. There is no tier to negotiate and no wall to hit.

Can you trade cost for speed?

Yes, and that lever does not exist on a hosted API. Because you pick the GPU, you pick your operating point on the cost-versus-throughput curve.

Model	L4: tok/s @ $/1M	RTX: tok/s @ $/1M	Pick
`e5-base-v2`	`181,292` @ `$0.0012`	`681,551` @ `$0.0012`	RTX (3.8x throughput, same cost)
`bge-m3`	`55,338` @ `$0.0040`	`235,430` @ `$0.0036`	RTX (faster and cheaper)
`stella_1.5B`	`14,307` @ `$0.0155`	`82,382` @ `$0.0102`	RTX (faster and cheaper)

For most models the RTX PRO 6000 is both faster and cheaper per token, because it buys more throughput than its hourly premium costs. The L4 is there when you want to floor the hourly spend on lighter models. Either way, you set the dial. A hosted API sets it for you and bills accordingly.

What this benchmark covers, and what’s next

This is the search and embedding slice of the SIE catalog. Embedding is one of the jobs SIE serves, not the whole picture, and we are running this same quality, performance, and cost methodology across the rest of the catalog. Treat this as the first installment, with more of the catalog reported the same way as it lands.

On reproducibility: every figure here comes from a closed observations dataset that passes full provenance checks (verify.py at 194/194 references OK, build.py at zero cache mismatches), measured on Modal GPUs (L4 at $0.80/hr, RTX PRO 6000 at $3.03/hr). Cost figures are the sustained, high-utilization floor at single on-demand GPU pricing, with no high-availability or idle overhead modeled. The scaling rows are extrapolations from measured per-worker throughput using the formulas above, not separately measured at every node count.

The takeaway

If you embed at scale, self-hosting on SIE gets you essentially frontier-quality retrieval at roughly one-twelfth the cost, several times the speed, and no rate-limit ceiling. The hosted APIs still win on zero-ops convenience and on spiky, low-volume workloads. For sustained ingestion and serving, the economics are not close.

Star SIE on GitHub · Read the encode docs · Browse the model catalog