Retrieval June 14, 2026

Hundreds of models, one deployment: how to kill the server-per-model sprawl

By Superlinked

Picture the usual setup: forty models, forty Helm releases, forty autoscalers, forty dashboards, and a fleet of GPUs sitting mostly idle because each model holds its own hardware whether or not traffic is hitting it.

That pattern does not survive contact with a real model catalog.

The fix is to stop treating the model as a property of the deployment and start treating it as a parameter of the request.

The Superlinked Inference Engine (SIE) does exactly that: one open-source server runs 85+ models on shared GPUs, loading each on demand and evicting the least-recently-used ones when memory fills.

You manage one cluster, and the model count inside it is just data.

Source: github.com/superlinked/sie.

How do I manage hundreds of models without creating separate deployments for each one?

Run them all from one SIE server that loads each model on demand and shares the GPU, then evicts the least-recently-used ones under memory pressure. Adding a model becomes a config write rather than a new deployment.

On-demand loading through a three-tier cache

You never pre-declare the full set. You request a model by its Hugging Face identifier and SIE resolves it:

Local disk cache on the node, with least-recently-used eviction once disk passes a configurable threshold.
Shared cluster cache (an S3 or GCS bucket) so worker pods do not each re-download the same weights.
Hugging Face Hub as the source of truth on a cold miss.

First call to a model downloads and loads it. Every call after is warm in milliseconds. When GPU memory fills, the least-recently-used model is evicted to make room, and all 85+ stay available at query time regardless of what is resident.

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")
client.encode("NovaSearch/stella_en_400M_v5", Item(text="loaded on first use"))
client.encode("BAAI/bge-m3",                  Item(text="no deployment step"))
client.score("BAAI/bge-reranker-v2-m3", Item(text="q"), [Item(text="d")])

Adding a model is a config write, not a release

New models do not mean new images or new deployments. You register the config against the running cluster through sie-config, the single authoritative writer for model configuration:

curl -X POST http://your-cluster/v1/configs/models \
  -H "Content-Type: application/json" \
  -d '{ "model_id": "your-org/your-model", "...": "..." }'

It persists the addition and publishes the change to the gateways and worker pods, which converge asynchronously. Details: Adding Models and the Config API.

The same artifact runs locally and in the cluster

There is no separate production mode. The code above runs against a local Docker server and against Kubernetes unchanged. In the cluster a stateless Rust gateway routes each request to a worker pool, work flows through a NATS JetStream queue, and KEDA autoscaling scales pools to zero when idle and back up on demand:

helm upgrade --install sie-cluster oci://ghcr.io/superlinked/charts/sie-cluster \
  --namespace sie --create-namespace \
  --set hfToken.create=true \
  --set hfToken.value=<TOKEN> \
  -f deploy/helm/sie-cluster/values-gke.yaml

What you actually operate now

One deployment. One autoscaler policy. One observability stack. A list of model identifiers. Adding the next model is a config write, not an infrastructure change, which is the whole point: the number of models you serve stops being the number of things you run.

Start from the deployment overview or clone the engine: github.com/superlinked/sie. Collapsing a hundred deployments into one tends to be worth a star.