Skip to content
Why did we open-source our inference engine? Read the post

Gateway

The SIE gateway is a stateless Rust service that sits between clients and GPU workers. It handles routing, queue submission, resource pools, worker health, read-side config, and scale-from-zero orchestration.

The page keeps the /docs/engine/router/ URL for compatibility, but the deployed component is sie-gateway.

Not every deployment needs a gateway. The deciding factor is whether you are running an elastic worker fleet:

  • Single server (local dev, single Docker container): connect the SDK directly to sie-server.
  • Kubernetes clusters: use the gateway. It provides a stable client endpoint, worker discovery, queue-based inference, scale-from-zero, resource pools, and config read endpoints.
  • Horizontal gateway replicas: supported. Each replica keeps its own in-memory registry and converges through bootstrap, NATS config deltas, and epoch polling.
SetupUse Gateway?Why
Single Docker containerNoConnect the SDK directly to the worker
Docker Compose (multi-worker)OptionalUseful for a single client endpoint in local tests
KubernetesYesRequired for worker discovery, queue routing, scale-from-zero, and pool isolation

Gateway architecture: SDK/HTTP Client to gateway, NATS queue, and GPU workers

The gateway is stateless with respect to durable data. It owns in-memory routing state, but it does not persist config and it does not execute inference.

Client request
-> sie-gateway resolves model, bundle, machine profile, and pool
-> gateway publishes msgpack work items to NATS JetStream
-> matching workers consume and execute inference
-> workers publish msgpack results to the gateway's NATS Core inbox
-> gateway assembles and returns the HTTP response

Config writes are outside this hot path. Admin tooling writes to sie-config, and gateways mirror that state through /v1/configs/export, NATS deltas, and /v1/configs/epoch polling.


The gateway resolves every inference request to:

  1. Model and profile: the model path and optional :profile suffix.
  2. Bundle: selected by adapter compatibility, with the lowest numeric bundle priority winning by default.
  3. Machine profile: X-SIE-MACHINE-PROFILE header or SDK gpu parameter.
  4. Pool: default pool or explicit X-SIE-Pool / SDK pool/profile target.
  5. Queue subject: sie.work.{model}.{pool} on the pool’s JetStream stream.

Unlike the previous Python router, the Rust gateway is queue-only for inference. There is no direct-HTTP fallback to workers. If the queue transport is unavailable, the gateway returns 503 instead of bypassing the queue.

Requests can specify a target machine profile:

# HTTP
curl -X POST http://gateway:8080/v1/encode/BAAI/bge-m3 \
-H "X-SIE-MACHINE-PROFILE: l4" \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "Hello world"}]}'
# SDK
result = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")

If the caller omits a machine profile, the gateway can use the default configured route. Scale-from-zero returns 202 when the selected (bundle, machine_profile) has no healthy worker and the caller did not pin an explicit pool.

When no healthy worker is registered for the selected (bundle, machine_profile) tuple and the caller did not pin a specific pool, the gateway returns:

HTTP/1.1 202 Accepted
Retry-After: 120
Content-Type: application/json
{
"status": "provisioning",
"gpu": "l4",
"bundle": "default",
"estimated_wait_s": 180,
"message": "No worker available for GPU type 'l4'. Provisioning in progress."
}

The SDK handles this automatically with wait_for_capacity=True. See Scale-from-Zero for details.

202 is only for capacity provisioning. Unknown models fail fast with 404 once the gateway registry has bootstrapped. Incompatible explicit bundle choices fail with 409.


List worker URLs explicitly:

sie-gateway serve \
-w http://worker-1:8080 \
-w http://worker-2:8080 \
-w http://worker-3:8080

Auto-discover workers via Kubernetes service endpoints:

sie-gateway serve \
--kubernetes \
--k8s-namespace sie \
--k8s-service sie-worker \
--k8s-port 8080

In Kubernetes mode, the gateway watches endpoint changes and automatically registers or deregisters workers. Worker status is then tracked over WebSocket (/ws/status) so the gateway sees bundle, machine profile, queue depth, loaded models, health, and config hash.


Resource pools reserve dedicated workers for tenant isolation. Pool workers only serve requests for that pool.

client = SIEClient("http://gateway:8080")
# Reserve 2 L4 workers for this tenant
client.create_pool("tenant-abc", {"l4": 2})
# Route requests to the pool
result = client.encode(
"BAAI/bge-m3",
Item(text="hello"),
gpu="tenant-abc/l4" # pool_name/gpu_type
)
# Check pool status
info = client.get_pool("tenant-abc")
# Cleanup
client.delete_pool("tenant-abc")
  • Pools are represented in Kubernetes ConfigMaps and Leases.
  • The SDK renews pool leases automatically in a background thread.
  • Pools expire after their TTL unless renewed.
  • The default pool is protected and cannot be deleted.

The gateway serves read-side config endpoints from its in-memory registry:

EndpointPurpose
GET /v1/configs/modelsList models known to this gateway
GET /v1/configs/models/{id}Return model YAML from the gateway registry
GET /v1/configs/models/{id}/statusReport per-replica worker ACK readiness
GET /v1/configs/bundlesList known bundles and connected worker counts
GET /v1/configs/bundles/{id}Return bundle YAML
POST /v1/configs/resolveDry-run model or explicit bundle override to bundle routing

The gateway is not a config write authority. POST /v1/configs/models is not registered on the gateway and returns 405 Method Not Allowed; send writes to sie-config.

On startup, the gateway:

  1. Optionally loads filesystem seeds from SIE_BUNDLES_DIR and SIE_MODELS_DIR if an escape-hatch config map is mounted.
  2. Reads GET /v1/configs/epoch to capture the authoritative epoch and bundle-set hash.
  3. Fetches bundles from sie-config with GET /v1/configs/bundles{,/{id}}.
  4. Fetches model state with GET /v1/configs/export.
  5. Subscribes to sie.config.models._all for live deltas.
  6. Polls GET /v1/configs/epoch every 30 seconds to catch missed deltas or bundle-set drift.

/readyz does not wait for sie-config. A fresh gateway can be ready before the first config bootstrap succeeds; during that window, typed requests may return 404 until the registry is populated.


The gateway aggregates health from all workers:

EndpointDescription
GET /healthzGateway liveness
GET /readyzGateway readiness; intentionally independent of sie-config reachability
GET /healthCluster summary: worker count, GPU count, models loaded
GET /v1/modelsModel list from the gateway registry
WS /ws/cluster-statusReal-time cluster metrics stream
curl http://gateway:8080/health
{
"status": "healthy",
"worker_count": 3,
"gpu_count": 3,
"models_loaded": 12,
"configured_gpu_types": ["l4", "a100-80gb"],
"live_gpu_types": ["l4"]
}

Important gateway metrics include:

MetricPurpose
sie_gateway_requests_totalHTTP requests by endpoint, status, and machine profile
sie_gateway_request_latency_secondsGateway request latency
sie_gateway_pending_demandKEDA scale-from-zero trigger by machine profile and bundle
sie_gateway_worker_queue_depthPer-worker queue depth
sie_gateway_config_epochHighest config epoch applied on this gateway
sie_gateway_config_bootstrap_degradedWhether bootstrap has been failing long enough to alert
sie_gateway_config_deltas_totalNATS config-delta processing outcomes
sie_gateway_nats_connectedGateway NATS connection state

Contact us

Tell us about your use case and we'll get back to you shortly.