Skip to content
Why did we open-source our inference engine? Read the post

Scale-from-Zero & Autoscaling

SIE clusters scale GPU workers to zero when idle and provision them on-demand. This page explains the full lifecycle, cold start expectations, and how to handle 202 responses.

When all workers are scaled to zero and a request arrives:

Scale-from-zero request flow through Gateway, KEDA, GKE, NATS, and Worker

Key point: The X-SIE-MACHINE-PROFILE header (or SDK gpu parameter) selects the worker pool when you need a specific machine profile. If it is omitted, the gateway can still resolve the model’s default route; scale-from-zero still returns 202 Accepted when capacity is not yet available.


Cold start from zero has three phases:

PhaseDurationWhat Happens
Node provisioning2-5 minGKE finds a GPU node (spot takes longer if scarce)
Container startup20-40sPull image, start process, health checks pass
Model loading10-120sDownload weights (if not cached) and load to GPU

Total cold start: 3-7 minutes depending on model size and spot availability.

Once a worker is warm, subsequent requests for any model on that worker are fast (model loads on-demand from local cache in 10-120s, or instantly if already in GPU memory).


When the cluster is scaled to zero, HTTP requests receive a 202 Accepted response:

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "X-SIE-MACHINE-PROFILE: l4" \
-d '{"items": [{"text": "Hello world"}]}'
# Response: 202 Accepted
# Headers: Retry-After: 120

Your HTTP client should retry after the Retry-After interval. Keep retrying for at least 7 minutes on a cold start.

If the requested machine profile is not configured, you get 503:

# Unknown X-SIE-MACHINE-PROFILE → 503
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "X-SIE-MACHINE-PROFILE: h100" \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "Hello world"}]}'
# Response: 503 Service Unavailable
# {"status": "gpu_not_configured", "gpu": "h100", "configured_gpu_types": ["l4", "a100-80gb"], "message": "GPU type 'h100' is not configured in this cluster."}

The SDK handles 202 retries automatically with wait_for_capacity=True:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://sie.example.com", api_key="YOUR_KEY")
# Automatically retries 202s with exponential backoff
result = client.encode(
"BAAI/bge-m3",
Item(text="Hello world"),
gpu="l4",
wait_for_capacity=True,
provision_timeout_s=420, # 7 minutes for cold start
)

Each (machine_profile, bundle) combination has its own KEDA ScaledObject and scales independently.

BundleModels ServedExample ScaledObject
defaultBGE-M3, E5, Stella, ColBERT, rerankers, GLiNER, GLiREL, GLiClass, Florence-2, Donut, and the rest of the standard catalogl4-spot-default
sglangLarge 4B+ parameter LLM embedding modelsa100-80gb-sglang

What this means in practice: If you have encode, score, and extract working on the default bundle worker, but then call encode with a large SGLang-served model (e.g. gte-Qwen2-7B-instruct), a separate sglang bundle worker needs to scale up. This is a new cold start - expect another 5-7 minutes.

# This uses the default bundle worker (already warm)
client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")
# This needs the sglang bundle worker (may trigger cold start)
client.encode(
"Alibaba-NLP/gte-Qwen2-7B-instruct",
Item(text="Tim Cook leads Apple."),
gpu="a100-80gb",
wait_for_capacity=True,
provision_timeout_s=420,
)

KEDA uses Prometheus metrics to make scaling decisions:

MetricPurposeUsed For
sie_gateway_pending_demandRequests waiting for a worker typeScale-from-zero activation
sie_gateway_worker_queue_depthItems queued per workerScale-up (add more replicas)
sie_gateway_active_lease_gpusGPUs reserved by active resource-pool leasesKeep leased pools provisioned
sie_gateway_rejected_requests_totalGateway rejected-request rateScale when rejected traffic indicates pressure
sie_gateway_requests_totalGateway request rateGateway Deployment autoscaling
autoscaling:
enabled: true
pollingInterval: 15 # Check metrics every 15 seconds
cooldownPeriod: 900 # 15 minutes before scaling to zero
scaleDownStabilization: 300 # 5 minute stabilization window
queueDepthThreshold: 10 # Scale up at 10 pending requests/pod
queueDepthActivation: 2 # Activate from zero at 2 requests

After no requests arrive for the cooldownPeriod (default: 15 minutes), KEDA scales workers back to zero. The next request triggers a full cold start again.

  • Consistent traffic: Lower cooldown (300s) to keep workers warm
  • Bursty traffic: Higher cooldown (900s) to avoid repeated cold starts
  • Cost-sensitive: Default 900s balances cost and responsiveness

The X-SIE-MACHINE-PROFILE header (HTTP) or gpu parameter (SDK) determines which worker pool receives the request.

ProfileGPUTypical Use
l4NVIDIA L4 (24GB)Standard inference, best price/performance
l4-spotNVIDIA L4 (spot)60-70% cheaper, may be preempted
a100-40gbNVIDIA A100 (40GB)Large models, high throughput
a100-80gbNVIDIA A100 (80GB)Very large models (7B+ params)

Spot instances offer significant cost savings but may take longer to provision if capacity is scarce.


Cause: The request pins a machine profile that is not configured in the cluster.

Fix: Use one of the configured machine profiles:

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "X-SIE-MACHINE-PROFILE: l4" \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "Hello world"}]}'

Or omit gpu and let the gateway resolve the model’s default route:

client.encode("BAAI/bge-m3", Item(text="hello"))

Possible causes:

  • Too short timeout - Cold starts take 5-7 minutes. Use provision_timeout_s=420 in the SDK
  • Spot GPU unavailable - Try a different machine profile (e.g., l4 instead of l4-spot)
  • KEDA not configured - Check that KEDA is installed and ScaledObjects exist: kubectl get scaledobjects -n sie
  • Prometheus down - KEDA needs Prometheus for metrics. Check: kubectl get pods -n monitoring

Workers scale up then immediately scale down

Section titled “Workers scale up then immediately scale down”

Cause: Requests stopped before the worker became ready. KEDA sees demand drop to 0 and begins cooldown.

Fix: Keep sending requests (or use the SDK with wait_for_capacity=True) for the full cold start duration. The SDK handles this automatically with retry logic.

Models from different bundles not available

Section titled “Models from different bundles not available”

Cause: Each bundle runs in a separate worker. Your standard models (default bundle) may be warm, but a large LLM embedding model (sglang bundle) needs its own worker to scale up.

Fix: Send requests with wait_for_capacity=True and a sufficient timeout. The target bundle’s worker will scale up independently.


Contact us

Tell us about your use case and we'll get back to you shortly.