How to Cut Token Usage and Costs in AI Search and Agents (Without Throwing More GPUs at It)
TL;DR: Most teams watch their large-model token bill and miss where cost actually compounds: the embeddings, reranking, and extraction calls running underneath every search and agent.
Those are the workloads best served by small models, and small models do not need a per-token API or a bigger GPU. Run them on infrastructure you control and you stop paying per token, raise GPU utilization, and keep your agent’s context tight enough to avoid context rot.
This is the case Filip makes at AI Engineer Europe 2026, and the reason Superlinked built and open-sourced the Superlinked Inference Engine (SIE).
Watch Filip’s full talk, then read on for how the ideas map to your token bill:
If you would rather see the reasoning first, here is why those costs add up and where small models change the math.
Why do token costs keep climbing in AI search and agentic workflows?
If you are tracking spend, you are probably watching the obvious meter: tokens into and out of a large language model. That meter is real, but it is not the only one, and often not the one growing fastest.
In a typical retrieval-augmented or agentic system, every query triggers a chain of smaller jobs before the big model ever sees a prompt. You embed the query. You embed and re-embed documents. You rerank candidates. You extract entities, classify intent, and route work. As Superlinked’s launch post puts it, a single document processing pipeline might chain four of these specialized models together. When those jobs run through hosted APIs that bill per token, the cost scales with traffic, with corpus size, and with every re-index. The unit price looks tiny on the pricing page. Multiplied across millions of documents and every user request, it is not.
There are two cost centers hiding here, and both are token problems:
- Per-token charges on the small jobs. Embedding, reranking, and extraction APIs meter you the same way a chat model does. High-volume search and ingestion turn a fraction-of-a-cent rate into a recurring line item that grows with usage.
- Token bloat at the large model. When retrieval is weak, teams compensate by stuffing more context into the prompt. More tokens in means a higher bill, slower responses, and, past a point, worse answers.
The fix for both runs through the same place: the small models doing retrieval and preprocessing.
What is context rot, and why does it inflate token usage?
Context rot is the quality decay you see when you keep enlarging a model’s context window. Relevant signal gets buried under marginally-relevant filler, attention spreads thin, and answers get worse even though you fed the model more. The instinct to “just add more context” backfires, and you pay for the privilege in tokens.
Small models are how you avoid that trap. Good embeddings and a strong reranker let you retrieve fewer, better passages, so the large model receives a tight, high-signal prompt instead of a dump. In the talk, Filip frames small models as essential for context management in agentic workflows: they handle retrieval, classification, and preprocessing so the expensive model only ever sees what matters. Fewer tokens in, better answers out, lower bill. That is the rare optimization where cost and quality move in the same direction.
Why throwing more GPUs (or a bigger model) at it does not help
The reflex when inference feels slow or expensive is to add hardware or reach for a larger model. For small-model workloads, both are inefficient.
A small encoder or reranker does not saturate a modern GPU. Pin one model to one container on one GPU and most of that card sits idle between requests. Superlinked’s own framing of the problem is stark: five models, five dedicated pools, each provisioned for peak and idle the rest of the time, landing at roughly 3% total utilization. You are paying for capacity you are not using, and scaling that pattern out means buying more idle GPUs, not more throughput.
The better lever, Filip argues, is utilization. Run many small models on the same GPU and switch between them as demand shifts, so the hardware stays busy instead of dedicated to one underused model. That is an infrastructure problem, not a hardware-budget problem, and it is the gap that production-grade small-model inference has largely ignored.
The fix: small models on infrastructure you control
The talk’s central image is a “yin and yang” of inference. A serving engine that works in production needs both halves, and most tools ship only one.
Model support (the yin)
Open-source embedding and reranking models do not share one clean architecture. A BERT encoder, a Qwen-based model, and a ColBERT-style multi-vector model differ in how they handle attention and positional embeddings, so the engine needs per-architecture handling rather than one generic path. In SIE, the docs describe this as a compute-engine abstraction: the server wraps PyTorch, SGLang, and Flash Attention behind three uniform primitives and picks the best engine per model automatically. This is the unglamorous work that decides whether a model runs fast or merely runs.
Infrastructure (the yang)
Models alone are not a system. Production needs routing, queuing, autoscaling, monitoring, and GPU provisioning, all of it automated. The alternative is the brittle one-model-per-container deployment that wastes compute and breaks the moment your model mix changes. Treating infrastructure as the equal partner of model support is what lets you move from a research notebook to production scale without hand-managing hardware.
What this looks like in practice
SIE is Superlinked’s open-source answer to that gap: one inference server for embeddings, reranking, and entity extraction, with 85+ pre-configured models, each with quality and latency targets checked in CI. The whole API is three primitives:
- Encode converts text or images to vectors for semantic search and RAG.
- Score reranks query-document pairs for higher-precision retrieval.
- Extract pulls entities and structured data from unstructured text.
The design choices map directly onto the token-cost problem:
- No per-token meter. SIE runs on your own infrastructure, from a laptop to a Kubernetes cluster, without paying per-token API costs. You pay for the compute you provision, not for every token embedded or scored.
- Multiple models per GPU, loaded on demand. Models load lazily on first request and are evicted with a least-recently-used policy, so a single GPU serves a rotating set of models and stays utilized. Per the docs, an L4 (24GB) keeps 2 to 3 standard models hot at once, while all 85+ remain available at query time regardless of VRAM.
- The full production stack, not just a server. A load-balancing gateway, KEDA autoscaling with scale-from-zero, Grafana dashboards, and Terraform modules for GKE and EKS ship in the box, all Apache 2.0.
- Drop-in migration. SIE exposes an OpenAI-compatible embeddings endpoint, and the docs include dedicated migration guides for OpenAI, Cohere, and TEI, so existing code can point at your own cluster instead of a paid API.
Because retrieval and reranking become local calls, the per-token cost on those jobs goes away and the work simply uses compute you already control:
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Embed a query locally. No per-token charge.query = client.encode( "sentence-transformers/all-MiniLM-L6-v2", Item(text="How do I lower my embedding costs?"),)
# Rerank candidates locally, so only the best few reach your LLM.ranked = client.score( "cross-encoder/ms-marco-MiniLM-L-6-v2", Item(text="How do I lower my embedding costs?"), [Item(text="Self-host small models."), Item(text="Unrelated passage.")],)A tighter, better-ranked candidate set means fewer tokens forwarded to the large model, which is the other half of the savings.
SIE vs hosted APIs: where the token cost goes
| SIE | TEI (Hugging Face) | OpenAI API | |
|---|---|---|---|
| Self-hosted | Yes | Yes | No |
| Multi-model on one GPU | Yes | No (one model per server) | N/A |
| Encode + Score + Extract | Yes | Encode only | Encode only |
| 85+ supported models | Yes | Varies | Limited |
| No per-token cost | Yes | Yes | No |
This comparison is drawn from the SIE documentation. For the full performance picture, Superlinked publishes a SIE vs TEI vs OpenAI benchmark, and its launch materials put the API-cost reduction from self-hosting as high as 50x. Treat the headline figure as workload-dependent and check it against your own traffic.
How small models reduce token usage, concretely
Pulling the threads together, there are three distinct ways this lowers token usage:
- Eliminating per-token billing on small jobs. Embedding, reranking, and extraction stop being metered API calls and become compute you own.
- Shrinking the large-model prompt. Stronger retrieval and reranking mean fewer, higher-quality passages in context, so the expensive model processes fewer input tokens per request.
- Avoiding the context-rot tax. Tight context keeps answer quality up, which removes the temptation to compensate by adding even more tokens.
None of this requires a bigger model or a bigger GPU bill. It requires running the right small models efficiently on hardware you control.
FAQ
Does self-hosting embeddings actually save money on tokens? For high-volume workloads, usually yes. Per-token embedding and reranking pricing scales with traffic and corpus size, while self-hosted small models convert that variable cost into fixed compute you provision and reuse. The crossover point depends on your volume, but ingestion-heavy and re-indexing-heavy systems tend to reach it quickly.
Can I replace my per-token embedding API?
That is the intended path. SIE provides an OpenAI-compatible embeddings endpoint for drop-in migration, plus dedicated migration guides for OpenAI, Cohere, and TEI, and a native SDK with encode, score, and extract covering embeddings, reranking, and extraction in one server.
Do small models hurt quality? For retrieval, reranking, classification, and extraction, well-chosen small models are competitive, which is why SIE checks every supported model against quality and latency targets in CI. Quality in these tasks comes from picking the right model and reranking well, not from model size alone.
What about reranking and extraction, not just embeddings? Those are token costs too when run through hosted APIs, and they run through the same SIE server. Reranking in particular is the lever that lets you forward fewer passages to the large model.
Is this only worth it for large teams? No. The same Docker image runs on a single machine for development and scales to a Kubernetes cluster with autoscaling and scale-from-zero for production. There is no separate production mode, so the setup grows with you rather than requiring a platform team up front.
The takeaway
The token bill that hurts is rarely just the chat model. It is the steady drip of embedding, reranking, and extraction calls underneath every search and agent, plus the context bloat that weak retrieval forces on you. Small models address both, and the thing that was missing was never the models. It is the infrastructure to run them efficiently. That is what SIE is for.
Explore the engine, browse the 85+ supported models, and run it yourself at github.com/superlinked/sie and superlinked.com/docs. For the why behind the build, read the launch post or watch Filip’s full talk above.