---
title: Why Are My Token Costs Going Up? How Open Source Inference Keeps Them Down
description: GitHub Copilot raised model multipliers up to 27x on June 1, 2026. Here is why token bills keep climbing, and how self-hosted open source inference for embeddings, reranking, and extraction cuts the cost that quietly compounds under your agents.
canonical_url: https://superlinked.com/blog/why-are-my-token-costs-going-up
last_updated: 2026-06-10
---

**TL;DR:** Another token price hike just landed, and it will not be the last. Your bill keeps climbing even when you change nothing, because the meters you rent are not yours to control. The durable fix is moving the high-volume work that runs underneath your agents, the embeddings, reranking, and extraction, onto open source models in your own cloud. That work has no per-token meter when you self-host it, open source quality is now genuinely competitive, and you can have a cluster running in your own AWS or GCP account in two commands.

<BlogSieCta />

## What just changed with GitHub Copilot pricing in June 2026?

On June 1, 2026, GitHub Copilot [moved from flat premium requests to usage-based billing](https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/), and for developers who stayed on an annual Pro or Pro+ plan under the legacy request system, the model multipliers jumped sharply the same day. A multiplier is how fast a model burns your premium-request allowance, so when it rises, the same work costs you more of your plan.

You do not need a leaked screenshot for this. The new multipliers are [published in GitHub's own documentation](https://docs.github.com/en/copilot/reference/copilot-billing/model-multipliers-for-annual-plans), and the jumps are steep. Claude Opus 4.6 went from 3x to 27x, a ninefold increase. Claude Opus 4.7 went from 15x to 27x. Claude Sonnet 4.6 went from 1x to 9x. GPT-5.4 went from 1x to 6x, and GPT-5.4 mini went from 0.33x to 6x, an eighteenfold jump on the model many teams reached for precisely because it was cheap. The highest multiplier GitHub documents is 27x. Only the smallest models, like Claude Haiku 4.5 at 0.33x, stayed flat.

This is not a rounding error. [TechTimes reported](https://www.techtimes.com/articles/317536/20260601/github-copilot-pricing-change-drives-backlash-agentic-bills-jump-10x-50x-power-users.htm) power users projecting agentic-session cost increases of 10x to 50x, and the free fallback model removed entirely. There is a second meter too: Copilot code review now consumes both AI Credits and GitHub Actions minutes, so automated reviews bill on two tracks at once. Base seat prices did not move. The metered economics underneath them did.

Here is the part worth sitting with. None of those developers changed a line of code. Their tools got more expensive because someone else re-priced the meter. That is the real story, and Copilot is only the latest example.

## Why do token costs keep climbing, even when you don't touch your code?

Because the variables that set your bill are not all under your control, and several of them moved in 2026.

Start with the meter you watch. Even at a fixed published rate, the tokens billed for the same prompt can change when a provider ships a new tokenizer. Pricing trackers reported that [Anthropic's Opus 4.7 shipped with a tokenizer](https://www.finout.io/blog/anthropic-api-pricing) that can produce up to 35 percent more tokens for the same input text versus the prior generation, with per-token prices unchanged. Same prompt, same listed price, bigger invoice. Long context is the next lever: [reporting on 2026 pricing](https://www.finout.io/blog/openai-vs-anthropic-api-pricing-comparison) notes that GPT-5.4 standard doubles input pricing above a 272K-token threshold and lifts output pricing for the whole session, so a context-heavy agent crosses into a pricier tier on its own.

Then the market backdrop. As the big labs march toward public markets, analysts expect margin pressure to show up as higher prices and higher floors. [Coverage of Anthropic's April 2026 enterprise changes](https://www.claudeapi.com/en/blog/news/anthropic-enterprise-pricing-analysis-2026/) described raised annual minimums and finer-grained volume tiers. [The Information's reporting](https://www.investing.com/analysis/the-ai-token-pricing-crisis-behind-openai-and-anthropics-revenue-race-200680777), echoed across the trade press, described a token pricing squeeze where monthly API costs per engineer at heavy adopters ran from 500 to 2,000 US dollars as usage scaled, with at least one company sent back to the drawing board on its budget.

The throughline is structural: per-token pricing means your cost rises with traffic, corpus size, and every re-index, and the rate card is not a contract you control. Copilot just made that visible to four million developers at once.

[Run SIE in your own cloud and own the meter](https://github.com/superlinked/sie).

## Where do token costs actually compound in an AI agent?

Not where most teams look. The chat model is the obvious meter. It is rarely the one growing fastest.

In a typical retrieval-augmented or agentic system, every query fans out into a chain of smaller jobs before the expensive model ever sees a prompt. You embed the query. You embed and re-embed documents on every ingest and re-index. You rerank candidates. You extract entities, classify intent, and route work. A single document processing pipeline can chain four of these specialized models together. Each call looks like a fraction of a cent on the pricing page. Multiply it across millions of documents and every user request and it becomes a line item that grows with adoption.

Two token problems hide here, and they feed each other:

1. **Per-token charges on the small jobs.** Embedding, reranking, and extraction APIs meter you the same way a chat model does. High-volume search and ingestion turn a tiny unit rate into a recurring bill.
2. **Token bloat at the large model.** When retrieval is weak, teams compensate by stuffing more context into the prompt. More input tokens means a higher bill, slower responses, and past a point, worse answers. That quality decay from oversized context is what practitioners call context rot.

The fix for both runs through the same place: the small models doing retrieval and preprocessing. Get those right and you stop paying per token on the high-volume layer while keeping the prompt to your expensive model tight.

## What is open source self-hosted inference, and how does it cut the bill?

Self-hosted inference means running model inference inside your own cloud account or hardware instead of sending each request to a third-party managed API. You control the hardware, the models, the configuration, and where the data goes.

The cost mechanic is simple and it is the opposite of what Copilot just did to its users. A per-token API turns every embed, rerank, and extract into a variable charge that scales forever with usage. Self-hosting turns that variable charge into fixed compute you provision once and reuse. There is no token meter on a model running on a GPU you already pay for. For ingestion-heavy and re-indexing-heavy systems, the point where self-hosting wins arrives fast.

Open source is what makes this practical in 2026. There is now a capable open model for nearly every retrieval and document task: dense and sparse embeddings, multi-vector ColBERT-style retrieval, cross-encoder rerankers, classification, named entity recognition, relationship extraction, and OCR. Roughly 100,000 models are uploaded to Hugging Face every month, and even the strongest proprietary models tend to get an equally capable open alternative within months. The models were never the missing piece. The infrastructure to run many of them efficiently has been.

[Clone SIE and have it running locally in two minutes](https://github.com/superlinked/sie). The same container scales from a laptop to a Kubernetes cluster.

## How much cheaper is self-hosting, really?

Here is the public cost comparison from the [Superlinked homepage](https://superlinked.com/), measured per one billion tokens, so you can check it against your own volume. Headline figure: up to 50x cheaper than managed model APIs, workload dependent.

**Encode (text to vectors), cost per 1B tokens**

| Provider | Cost | Basis |
| --- | --- | --- |
| OpenAI API | $20 | embedding-3-small at $0.02 / 1M tokens |
| Modal + TEI | $1.30 | bge-base on A10G at $1.10 / hr |
| Your Cloud + SIE | $0.50 | bge-base on spot A10G at $0.38 / hr |

**Score (reranking), cost per 1B tokens**

| Provider | Cost | Basis |
| --- | --- | --- |
| Cohere Rerank | $87 | Rerank 3.5 at $2 / 1K queries |
| Vertex AI Ranking | $43 | Ranking API at $1 / 1K queries |
| Your Cloud + SIE | $8.50 | modernbert-base on spot A10G at $0.38 / hr |

**Extract (entities and structured data), cost per 1B tokens**

| Provider | Cost | Basis |
| --- | --- | --- |
| OpenAI API | $140 | GPT-4.1 Nano at $0.10 / 1M input |
| Google Cloud NL | $5,000 | Entity Analysis at $1 / 1M chars |
| AWS Comprehend | $5,000 | Entity Recognition at $1 / 1M chars |
| Your Cloud + SIE | $5 | GLiNER on spot A10G at $0.38 / hr |

The pattern holds across all three primitives. The unit economics are different in kind: you pay for a GPU hour you already control, not for every token that crosses a third-party boundary. As a directional industry signal, Intercom's Chief AI Officer described on the Chain of Thought podcast cutting roughly 250,000 US dollars a month by replacing a general model with a fine-tuned 14B open model for one pipeline task. Treat any single number as workload-specific and verify it against your own traffic.

[Pressure-test the numbers on your workload](https://github.com/superlinked/sie).

## Does cheaper inference mean worse quality?

For retrieval, reranking, classification, and extraction, no. Quality on these tasks comes from picking the right model and reranking well, not from raw model size.

Two things make that real rather than hopeful. First, the catalog. SIE ships 85+ pre-configured open models across encoders, rerankers, and extractors, including multilingual and multi-vector options like BGE-M3 and ColBERT, and every supported model is checked against quality and latency targets in CI using MTEB. Second, the architecture. A strong reranker lets you retrieve fewer, better passages, so the expensive model receives a tight, high-signal prompt instead of a dump. Fewer tokens in, better answers out, lower bill. That is the rare optimization where cost and quality move the same direction, because tighter retrieval is also the cure for context rot.

## What about control and portability, not just cost?

Self-hosting earns its place even for teams that are not cost-constrained yet.

**Control.** When you send prompts and documents to a third-party endpoint, you hand over where that data lives and how it is processed. That is a real compliance exposure, not a hypothetical: in January 2025 the Italian data protection authority blocked a provider from processing local users' personal data over an insufficient privacy response. Self-hosted inference keeps prompts and context inside your own AWS or GCP account. You pick the models, you pick the configuration, and the data does not leave your environment. SIE is SOC2 Type 2 certified and Apache 2.0 licensed.

**Portability.** Most enterprises now run hybrid, averaging more than two public cloud providers, and every customer cloud has its own compliance rules. A self-hosted engine that ships as a Helm chart for EKS and GKE plus Terraform modules for AWS and GCP installs the same way wherever your platform runs, including air-gapped environments from mirrored model weights. One inference layer covers your own SaaS cloud, multi-tenant, and a customer's cloud, single-tenant, with no bespoke deployment for each.

## Why not just self-host with vLLM or TEI yourself?

You can, and many teams start there. The gap is not the model server, it is everything around it.

Open source embedding and reranking models do not share one clean architecture. A BERT encoder, a Qwen-based model, and a ColBERT multi-vector model differ in how they handle attention and positional embeddings, so the right runtime and flags differ per model. Tools built for one large model spread across many GPUs are the wrong shape for many small models sharing a GPU. Pin one small encoder to one container on one GPU and the card sits mostly idle between requests. The common result of one-model-per-pool deployments is roughly 3 percent total GPU utilization. You buy more idle hardware to scale, not more throughput.

A purpose-built small-model engine closes that gap two ways. It keeps several models resident on the same GPU, loading them on demand and evicting the least recently used, so an L4 can keep two to three models hot while all 85+ stay available at query time. And it batches concurrent requests across the whole pool before each GPU pass. The architecture behind SIE shows the difference plainly: route work to a worker and then batch only its local slice and you land near 51 percent GPU efficiency, but let a free worker pull from one shared queue and batch across it and you reach 89 percent. That is about 1.8x the throughput per GPU at the same latency, roughly 80 percent higher throughput on small-model workloads. SIE also ships a tuned config per model, so you skip the per-model, per-GPU flag hunt that otherwise eats an afternoon for every model you add.

[Read the architecture, Helm chart, and Terraform modules](https://github.com/superlinked/sie).

## How do I move my agent's inference into my own cloud?

Make it a drop-in swap, then migrate at your own pace. The whole point is that it should take two commands, not a quarter.

SIE exposes an OpenAI-compatible embeddings endpoint and ships dedicated migration guides for OpenAI, Cohere, and TEI, so existing code can point at your own cluster instead of a paid API. The native SDK gives you three primitives, encode, score, and extract, in one server, with Python and TypeScript SDKs and integrations for LangChain, LlamaIndex, Haystack, DSPy, CrewAI, and the major vector databases including Chroma, Qdrant, and Weaviate. A typical path:

1. Start the container locally and run one encode and one score call to confirm parity with your current API output.
2. Point one high-volume path, usually ingestion embeddings or reranking, at SIE behind a flag.
3. Stand up the cluster: `terraform apply` to create the GPU infrastructure, then `helm install sie` to deploy. Turn on KEDA autoscaling with scale-from-zero and watch utilization on the bundled Grafana dashboards.
4. Migrate the remaining retrieval and extraction calls once the cost and quality numbers hold.

You do not need to move your frontier reasoning model to start saving. You move the layer that quietly compounds, the embeddings, reranking, and extraction running under every search and agent, and you tighten the context those models feed upstream. The meter on that work simply turns off.

## FAQ

**Why did my GitHub Copilot bill jump in June 2026?**
On June 1, 2026, GitHub switched Copilot to usage-based billing and raised model multipliers for annual Pro and Pro+ subscribers who stayed on the legacy request system. Per GitHub's documentation, frontier models now consume far more of your allowance: Claude Opus 4.6 rose from 3x to 27x, Sonnet 4.6 from 1x to 9x, and GPT-5.4 mini from 0.33x to 6x. Reporting describes agentic sessions costing 10x to 50x more for power users. Base seat prices were unchanged.

**Why does my token bill rise when I did not change anything?**
Several 2026 levers move your cost without any code change: new tokenizers can bill more tokens for the same input, long-context tiers re-price requests above certain thresholds, enterprise minimums have risen, and per-token pricing scales with your own growing traffic and corpus.

**Does self-hosting embeddings actually save money?**
For high-volume workloads, usually yes. Per-token embedding and reranking prices scale with traffic and corpus size, while self-hosted small models convert that variable cost into fixed compute you provision and reuse. Ingestion-heavy and re-indexing-heavy systems reach the crossover point fast. Compare the per-1B-token table above against your volume.

**Can open source models match proprietary quality for search?**
For embedding, reranking, classification, and extraction, well-chosen open models are competitive, which is why SIE verifies every model against MTEB quality and latency targets in CI. Quality on these tasks is about model choice and reranking, not size.

**Will I lose control of my data?**
No. Self-hosted inference runs inside your own AWS or GCP account, so prompts and context never leave your environment. SIE is SOC2 Type 2 certified, Apache 2.0 licensed, and can run air-gapped.

**Do I have to replace my main LLM?**
No. The savings come from moving the high-volume small-model layer, embeddings, reranking, and extraction, off per-token APIs, plus shrinking the prompt your large model receives. Your reasoning model can stay where it is.

## The takeaway

The Copilot hike is a preview. As long as the meter belongs to someone else, the price is theirs to raise, and the bill that hurts is rarely just the chat model. It is the steady drip of embedding, reranking, and extraction calls underneath every search and agent, plus the context bloat that weak retrieval forces on you. Open source small models address both halves, and the piece that was missing was never the models, it was the infrastructure to run them efficiently on hardware you control. That is what the Superlinked Inference Engine is for.

[Star SIE on GitHub, browse the 85+ models, and run the quickstart](https://github.com/superlinked/sie). Self-host the inference that powers your agents, and stop renting other people's tokens.

## Sources

- GitHub Copilot model multipliers for annual plans, official documentation: https://docs.github.com/en/copilot/reference/copilot-billing/model-multipliers-for-annual-plans
- GitHub blog, Copilot moving to usage-based billing: https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/
- TechTimes, Copilot pricing change and projected 10x to 50x increases: https://www.techtimes.com/articles/317536/20260601/github-copilot-pricing-change-drives-backlash-agentic-bills-jump-10x-50x-power-users.htm
- Finout, Anthropic API pricing 2026 (Opus 4.7 tokenizer note): https://www.finout.io/blog/anthropic-api-pricing
- Finout, OpenAI vs Anthropic pricing (GPT-5.4 long-context surcharge): https://www.finout.io/blog/openai-vs-anthropic-api-pricing-comparison
- Anthropic enterprise pricing analysis, April 2026: https://www.claudeapi.com/en/blog/news/anthropic-enterprise-pricing-analysis-2026/
- Investing.com, the AI token pricing crisis (per-engineer API spend): https://www.investing.com/analysis/the-ai-token-pricing-crisis-behind-openai-and-anthropics-revenue-race-200680777
- Superlinked homepage cost comparison table: https://superlinked.com/
- Superlinked Inference Engine on GitHub: https://github.com/superlinked/sie