Why did we open-source our inference engine? Read the post
← All Posts

Building production inference: routing, batching, model configs, and LoRA in one cluster

One system handles all four.

The Superlinked Inference Engine (SIE) puts routing in a stateless gateway, batching in the worker pods, model configuration in a single-writer control plane, and LoRA adapters in a per-request option.

Your application keeps calling encode, score, and extract, and the cluster does the production work underneath.

It is open source: github.com/superlinked/sie.

Each of the four is one section below.

How do I manage routing, batching, model configs, and LoRA adapters for production inference?

SIE assigns each concern to one component: routing to a stateless gateway, batching to the worker pods, model configuration to a single-writer control plane, and LoRA to a per-request option. You operate one cluster and your code keeps calling three functions.

Routing

A stateless Rust gateway sits between clients and GPU worker pods. Per request it resolves the model, bundle, machine profile, and resource pool from an in-memory registry, then publishes the work to a NATS JetStream queue. It also tracks worker health from heartbeats and isolates capacity with resource pools.

You never configure routes by hand. You name a model, the gateway places it:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://your-gateway:8080")
client.encode("BAAI/bge-m3", Item(text="routed by the gateway"))

If the target pool is scaled to zero, the gateway returns 202 Accepted with a Retry-After header, and the SDK waits for capacity when you set wait_for_capacity=True. That is how scale-from-zero stays invisible to your code.

Batching

Batching happens next to the GPU. Each worker pod runs a sidecar that pulls work from its queue and groups requests into batches by model, operation, and LoRA key, then hands fully formed batches to the model-execution process over local IPC. Keying on those three fields is what keeps batches correct: only requests for the same model, same operation, and same adapter combine.

Your lever is request shape. Send items in lists so the server can batch them:

client.encode("NovaSearch/stella_en_400M_v5", [Item(text=c) for c in chunks])

Batch-size and concurrency tuning live in Performance Tuning.

Model configs

Configuration is owned by sie-config, the authoritative control plane that runs as a single writer, persists model configs, and publishes runtime deltas while gateways and worker pods converge asynchronously. Keeping writes off the hot path is the reason the inference edge stays stateless.

curl -X POST http://your-cluster/v1/configs/models \
-H "Content-Type: application/json" \
-d '{ "model_id": "your-org/your-encoder", "...": "..." }'

Gateways bootstrap from a full snapshot via GET /v1/configs/export, subscribe to live deltas, and poll GET /v1/configs/epoch to recover anything missed. To version-control this, see the Config GitOps workflow.

LoRA adapters

Adapters are a per-request option, never a separate deployment. You pass the adapter name on the call; the base model loads once and is shared, the adapter applies on top, and the batching layer keys on the adapter so different adapters batch separately.

client.encode("BAAI/bge-m3", Item(text="indemnification"), options={"lora": "legal"})

The model-execution process owns model loading, LoRA loading, and memory-pressure eviction. Full details: LoRA Adapters.

How the four fit together

Client SDK
-> sie-gateway routing: resolves model, bundle, profile, pool
-> NATS JetStream queue
-> worker sidecar batching: groups by model, operation, LoRA key
-> sie-server execution: model + LoRA loading, GPU inference
Admin -> sie-config model configs: single-writer control plane

Four responsibilities, one cluster to operate. Stand it up with the same Helm chart used everywhere else:

helm upgrade --install sie-cluster oci://ghcr.io/superlinked/charts/sie-cluster \
--namespace sie --create-namespace \
--set hfToken.create=true \
--set hfToken.value=<TOKEN> \
-f deploy/helm/sie-cluster/values-gke.yaml

Further reading, in order of depth:

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.