Gateway

The SIE gateway is a stateless Rust service that sits between clients and GPU worker pods. It handles routing, queue submission, resource pools, SIE server sidecar health, read-side config, and scale-from-zero orchestration. In Kubernetes, each worker pod runs the SIE server sidecar beside the Python sie-server adapter process; the sidecar pulls queued work and calls the adapter over IPC.

The page keeps the /docs/engine/router/ URL for compatibility, but the deployed component is sie-gateway.

When to Use the Gateway

Not every deployment needs a gateway. The deciding factor is whether you are running an elastic worker fleet:

Single server (local dev or Docker): point the SDK at a standalone sie-server.
Kubernetes clusters: use the gateway. It provides a stable client endpoint, worker discovery, queue-based inference, scale-from-zero, resource pools, and config read endpoints.
Horizontal gateway replicas: supported. Each replica keeps its own in-memory registry and converges through bootstrap, NATS config deltas, and epoch polling.

Setup	Use Gateway?	Why
Single Docker container	No	One `sie-server` process handles the request path
Kubernetes	Yes	Required for worker discovery, queue routing, scale-from-zero, and pool isolation

Architecture

The gateway is stateless with respect to durable data. It owns in-memory routing state, but it does not persist config and it does not execute inference.

Client request
  -> sie-gateway resolves model, bundle, machine profile, and pool
  -> gateway publishes msgpack work items to NATS JetStream
  -> matching worker pod's SIE server sidecar pulls, batches, and calls the sie-server adapter over UDS IPC
  -> SIE server sidecar publishes msgpack results to the gateway's NATS Core inbox
  -> gateway assembles and returns the HTTP response

Config writes are outside this hot path. Admin tooling writes to sie-config, and gateways mirror that state through /v1/configs/export, NATS deltas, and /v1/configs/epoch polling.

Request Routing

The gateway resolves every inference request to:

Model and profile: the model path and optional :profile suffix.
Bundle: selected by adapter compatibility, with the lowest numeric bundle priority winning by default.
Machine profile: X-SIE-MACHINE-PROFILE header or SDK gpu parameter.
Pool: default pool or explicit X-SIE-Pool / SDK pool/profile target.
Queue subject: sie.work.{model}.{pool} on the pool’s JetStream stream, consumed by the SIE server sidecar inside matching worker pods.

The Rust gateway is queue-only for inference. If the queue transport is unavailable, the gateway returns 503.

GPU Routing

Requests can specify a target machine profile:

# HTTP
curl -X POST http://gateway:8080/v1/encode/BAAI/bge-m3 \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -H "Content-Type: application/json" \
  -d '{"items": [{"text": "Hello world"}]}'

Python
TypeScript

# SDK
result = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")

// SDK
const result = await client.encode("BAAI/bge-m3", { text: "hello" }, { gpu: "l4" });

If the caller omits a machine profile, the gateway can use the default configured route. Scale-from-zero returns 202 when the selected (bundle, machine_profile) has no fresh SIE server sidecar health and the caller did not pin an explicit pool.

202 Scale-from-Zero

When no healthy SIE server sidecar has recently published health for the selected (bundle, machine_profile) tuple and the caller did not pin a specific pool, the gateway returns:

HTTP/1.1 202 Accepted
Retry-After: 120
Content-Type: application/json

{
  "status": "provisioning",
  "gpu": "l4",
  "bundle": "default",
  "estimated_wait_s": 180,
  "message": "No worker available for GPU type 'l4'. Provisioning in progress."
}

The SDK handles this automatically with wait_for_capacity=True. See Scale-from-Zero for details.

202 is only for capacity provisioning. Unknown models fail fast with 404 once the gateway registry has bootstrapped. Incompatible explicit bundle choices fail with 409.

Sidecar Health And Discovery

The production Helm path runs the SIE server sidecar inside each worker pod and uses NATS health. The sidecar publishes sie.health.<worker_id> heartbeats with the worker pod’s bundle, machine profile, queue depth, loaded models, and bundle_config_hash; the gateway builds its routing registry from those heartbeats.

Mode	Used for	Health source
`nats`	Default chart path with SIE server sidecar	`sie.health.<worker_id>` heartbeats from the SIE server sidecar
`ws`	Local status diagnostics	Python `sie-server` `/ws/status` stream
`static`	Explicit local diagnostics	Operator-provided worker URLs

The gateway still owns pool state through Kubernetes ConfigMaps and Leases. Kubernetes is not on the inference request path; queued work moves through NATS JetStream.

Local Diagnostics

For hand-run gateway processes that inspect a standalone sie-server /ws/status, list worker URLs explicitly:

sie-gateway serve \
  -w http://worker-1:8080 \
  -w http://worker-2:8080 \
  -w http://worker-3:8080

With queue-mode SIE server sidecar routing, the chart leaves gateway.healthMode empty and renders the routing-safe default, nats.

Resource Pools

Resource pools reserve dedicated worker pods for tenant isolation. Pool worker pods only serve requests for that pool.

Create a Pool

client = SIEClient("http://gateway:8080")

# Reserve 2 L4 workers for this tenant
client.create_pool("tenant-abc", {"l4": 2})

# Route requests to the pool
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    gpu="tenant-abc/l4"  # pool_name/gpu_type
)

# Check pool status
info = client.get_pool("tenant-abc")

# Cleanup
client.delete_pool("tenant-abc")

Pool Lifecycle

Pools are represented in Kubernetes ConfigMaps and Leases.
The SDK renews pool leases automatically in a background thread.
Pools expire after their TTL unless renewed.
The default pool is protected and cannot be deleted.

Config Read Surface

The gateway serves read-side config endpoints from its in-memory registry:

Endpoint	Purpose
`GET /v1/configs/models`	List models known to this gateway
`GET /v1/configs/models/{id}`	Return model YAML from the gateway registry
`GET /v1/configs/models/{id}/status`	Report per-replica config-hash readiness
`GET /v1/configs/bundles`	List known bundles and visible SIE server sidecar health counts
`GET /v1/configs/bundles/{id}`	Return bundle YAML
`POST /v1/configs/resolve`	Dry-run model or explicit bundle override to bundle routing

The gateway is not a config write authority. POST /v1/configs/models is not registered on the gateway and returns 405 Method Not Allowed; send writes to sie-config.

Bootstrap and Recovery

On startup, the gateway:

Optionally loads filesystem seeds from SIE_BUNDLES_DIR and SIE_MODELS_DIR if an escape-hatch config map is mounted.
Reads GET /v1/configs/epoch to capture the authoritative epoch and bundle-set hash.
Fetches bundles from sie-config with GET /v1/configs/bundles{,/{id}}.
Fetches model state with GET /v1/configs/export.
Subscribes to sie.config.models._all for live deltas.
Polls GET /v1/configs/epoch every 30 seconds to catch missed deltas or bundle-set drift.

/readyz does not wait for sie-config. A fresh gateway can be ready before the first config bootstrap succeeds; during that window, typed requests may return 404 until the registry is populated.

Health & Status

The gateway aggregates SIE server sidecar health records:

Endpoint	Description
`GET /healthz`	Gateway liveness
`GET /readyz`	Gateway readiness; intentionally independent of `sie-config` reachability
`GET /health`	Cluster summary: worker count, GPU count, models loaded
`GET /v1/models`	Model list from the gateway registry
`WS /ws/cluster-status`	Real-time cluster metrics stream

Cluster Health Example

curl http://gateway:8080/health

{
  "status": "healthy",
  "worker_count": 3,
  "gpu_count": 3,
  "models_loaded": 12,
  "configured_gpu_types": ["l4", "a100-80gb"],
  "live_gpu_types": ["l4"]
}

Metrics

Important gateway metrics include:

Metric	Purpose
`sie_gateway_requests_total`	HTTP requests by endpoint, status, and machine profile
`sie_gateway_request_latency_seconds`	Gateway request latency
`sie_gateway_pending_demand`	KEDA scale-from-zero trigger by machine profile and bundle
`sie_gateway_worker_queue_depth`	Per-worker queue depth
`sie_gateway_config_epoch`	Highest config epoch applied on this gateway
`sie_gateway_config_bootstrap_degraded`	Whether bootstrap has been failing long enough to alert
`sie_gateway_config_deltas_total`	NATS config-delta processing outcomes
`sie_gateway_nats_connected`	Gateway NATS connection state

What’s Next

Scale-from-Zero - the 202 flow and cold start handling
Config API - runtime config writes and gateway readiness polling
Kubernetes in GCP - full deployment with the gateway
Monitoring - metrics and dashboards