---
title: Scale-from-Zero & Autoscaling
description: How KEDA autoscaling works, cold start expectations, and troubleshooting the 202 flow.
canonical_url: https://superlinked.com/docs/deployment/autoscaling
last_updated: 2026-05-19
---

SIE clusters scale GPU workers to zero when idle and provision them on-demand. This page explains the full lifecycle, cold start expectations, and how to handle 202 responses.

## How Scale-from-Zero Works

Source: [deploy/helm/sie-cluster/values.yaml](https://github.com/superlinked/sie/blob/main/deploy/helm/sie-cluster/values.yaml)

When all workers are scaled to zero and a request arrives:

![Scale-from-zero request flow through Gateway, KEDA, GKE, NATS, and Worker](/diagrams/autoscaling-flow.svg)

**Key point:** The `X-SIE-MACHINE-PROFILE` header (or SDK `gpu` parameter) selects the worker pool when you need a specific machine profile. If it is omitted, the gateway can still resolve the model's default route; scale-from-zero still returns `202 Accepted` when capacity is not yet available.

---

## Cold Start Timeline

Source: [deploy/helm/sie-cluster/README.md](https://github.com/superlinked/sie/blob/main/deploy/helm/sie-cluster/README.md)

Cold start from zero has three phases:

| Phase | Duration | What Happens |
|-------|----------|--------------|
| Node provisioning | 2-5 min | GKE finds a GPU node (spot takes longer if scarce) |
| Container startup | 20-40s | Pull image, start process, health checks pass |
| Model loading | 10-120s | Download weights (if not cached) and load to GPU |

**Total cold start: 3-7 minutes** depending on model size and spot availability.

Once a worker is warm, subsequent requests for any model on that worker are fast (model loads on-demand from local cache in 10-120s, or instantly if already in GPU memory).

---

## The 202 Flow

### HTTP Clients

When the cluster is scaled to zero, HTTP requests receive a `202 Accepted` response:

```bash
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -d '{"items": [{"text": "Hello world"}]}'

# Response: 202 Accepted
# Headers: Retry-After: 120
```

Your HTTP client should retry after the `Retry-After` interval. Keep retrying for at least 7 minutes on a cold start.

**If the requested machine profile is not configured, you get 503:**

```bash
# Unknown X-SIE-MACHINE-PROFILE → 503
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "X-SIE-MACHINE-PROFILE: h100" \
  -H "Content-Type: application/json" \
  -d '{"items": [{"text": "Hello world"}]}'

# Response: 503 Service Unavailable
# {"status": "gpu_not_configured", "gpu": "h100", "configured_gpu_types": ["l4", "a100-80gb"], "message": "GPU type 'h100' is not configured in this cluster."}
```

### SDK Clients (Recommended)

The SDK handles 202 retries automatically with `wait_for_capacity=True`:

#### Python

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://sie.example.com", api_key="YOUR_KEY")

# Automatically retries 202s with exponential backoff
result = client.encode(
    "BAAI/bge-m3",
    Item(text="Hello world"),
    gpu="l4",
    wait_for_capacity=True,
    provision_timeout_s=420,  # 7 minutes for cold start
)
```

#### TypeScript

```typescript
import { SIEClient } from "@superlinked/sie-sdk";

const client = new SIEClient("http://sie.example.com", {
  apiKey: "YOUR_KEY",
});

// Automatically retries 202s with exponential backoff
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "Hello world" },
  {
    gpu: "l4",
    waitForCapacity: true,
    provisionTimeout: 420000, // 7 minutes for cold start (milliseconds)
  }
);
```

---

## Per-Bundle Scaling

Source: [deploy/helm/sie-cluster/values.yaml](https://github.com/superlinked/sie/blob/main/deploy/helm/sie-cluster/values.yaml)

Each `(machine_profile, bundle)` combination has its own KEDA ScaledObject and scales independently.

| Bundle | Models Served | Example ScaledObject |
|--------|--------------|---------------------|
| `default` | BGE-M3, E5, Stella, ColBERT, rerankers, GLiNER, GLiREL, GLiClass, Florence-2, Donut, and the rest of the standard catalog | `l4-spot-default` |
| `sglang` | Large 4B+ parameter LLM embedding models | `a100-80gb-sglang` |

**What this means in practice:** If you have `encode`, `score`, and `extract` working on the `default` bundle worker, but then call `encode` with a large SGLang-served model (e.g. `gte-Qwen2-7B-instruct`), a *separate* `sglang` bundle worker needs to scale up. This is a new cold start - expect another 5-7 minutes.

```python
# This uses the default bundle worker (already warm)
client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")

# This needs the sglang bundle worker (may trigger cold start)
client.encode(
    "Alibaba-NLP/gte-Qwen2-7B-instruct",
    Item(text="Tim Cook leads Apple."),
    gpu="a100-80gb",
    wait_for_capacity=True,
    provision_timeout_s=420,
)
```

---

## KEDA Scaling Metrics

Source: [deploy/helm/sie-cluster/values.yaml](https://github.com/superlinked/sie/blob/main/deploy/helm/sie-cluster/values.yaml)

KEDA uses Prometheus metrics to make scaling decisions:

| Metric | Purpose | Used For |
|--------|---------|----------|
| `sie_gateway_pending_demand` | Requests waiting for a worker type | Scale-from-zero activation |
| `sie_gateway_worker_queue_depth` | Items queued per worker | Scale-up (add more replicas) |
| `sie_gateway_active_lease_gpus` | GPUs reserved by active resource-pool leases | Keep leased pools provisioned |
| `sie_gateway_rejected_requests_total` | Gateway rejected-request rate | Scale when rejected traffic indicates pressure |
| `sie_gateway_requests_total` | Gateway request rate | Gateway Deployment autoscaling |

### Configuration

```yaml
autoscaling:
  enabled: true
  pollingInterval: 15          # Check metrics every 15 seconds
  cooldownPeriod: 900          # 15 minutes before scaling to zero
  scaleDownStabilization: 300  # 5 minute stabilization window
  queueDepthThreshold: 10     # Scale up at 10 pending requests/pod
  queueDepthActivation: 2     # Activate from zero at 2 requests
```

### Cooldown Behavior

After no requests arrive for the `cooldownPeriod` (default: 15 minutes), KEDA scales workers back to zero. The next request triggers a full cold start again.

- **Consistent traffic**: Lower cooldown (300s) to keep workers warm
- **Bursty traffic**: Higher cooldown (900s) to avoid repeated cold starts
- **Cost-sensitive**: Default 900s balances cost and responsiveness

---

## Machine Profiles

The `X-SIE-MACHINE-PROFILE` header (HTTP) or `gpu` parameter (SDK) determines which worker pool receives the request.

| Profile | GPU | Typical Use |
|---------|-----|-------------|
| `l4` | NVIDIA L4 (24GB) | Standard inference, best price/performance |
| `l4-spot` | NVIDIA L4 (spot) | 60-70% cheaper, may be preempted |
| `a100-40gb` | NVIDIA A100 (40GB) | Large models, high throughput |
| `a100-80gb` | NVIDIA A100 (80GB) | Very large models (7B+ params) |

Spot instances offer significant cost savings but may take longer to provision if capacity is scarce.

---

## Troubleshooting

### 503 for unconfigured machine profile

**Cause:** The request pins a machine profile that is not configured in the cluster.

**Fix:** Use one of the configured machine profiles:

```bash
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -H "Content-Type: application/json" \
  -d '{"items": [{"text": "Hello world"}]}'
```

Or omit `gpu` and let the gateway resolve the model's default route:

```python
client.encode("BAAI/bge-m3", Item(text="hello"))
```

### 202 responses that never resolve

**Possible causes:**
- **Too short timeout** - Cold starts take 5-7 minutes. Use `provision_timeout_s=420` in the SDK
- **Spot GPU unavailable** - Try a different machine profile (e.g., `l4` instead of `l4-spot`)
- **KEDA not configured** - Check that KEDA is installed and ScaledObjects exist: `kubectl get scaledobjects -n sie`
- **Prometheus down** - KEDA needs Prometheus for metrics. Check: `kubectl get pods -n monitoring`

### Workers scale up then immediately scale down

**Cause:** Requests stopped before the worker became ready. KEDA sees demand drop to 0 and begins cooldown.

**Fix:** Keep sending requests (or use the SDK with `wait_for_capacity=True`) for the full cold start duration. The SDK handles this automatically with retry logic.

### Models from different bundles not available

**Cause:** Each bundle runs in a separate worker. Your standard models (`default` bundle) may be warm, but a large LLM embedding model (`sglang` bundle) needs its own worker to scale up.

**Fix:** Send requests with `wait_for_capacity=True` and a sufficient timeout. The target bundle's worker will scale up independently.

---

## What's Next

- [Kubernetes in GCP](/docs/deployment/cloud-gcp/) - full GKE deployment setup
- [Monitoring](/docs/deployment/monitoring/) - metrics for tracking autoscaling behavior
- [Bundles](/docs/engine/bundles/) - understanding dependency isolation
