Kubernetes in GCP
Deploy SIE to GKE with GPU node pools, KEDA autoscaling, and Terraform automation.
Architecture
Section titled “Architecture”SIE runs as a router-worker architecture on Kubernetes:
Components:
- Router - Stateless proxy that routes requests to GPU-specific worker pools
- Worker Pools - StatefulSets grouped by GPU type (L4, A100-40GB, A100-80GB)
- KEDA - Scales worker pools from zero based on queue depth metrics
- Prometheus - Provides metrics for autoscaling decisions
Router
Section titled “Router”The router is a stateless FastAPI application that handles GPU-aware routing:
| Feature | Description |
|---|---|
| GPU Routing | Routes requests to appropriate GPU pool via X-SIE-MACHINE-PROFILE header |
| Pool Routing | Supports tenant isolation via X-SIE-Pool header |
| Model Affinity | Prefers workers with the requested model already loaded |
| Load Balancing | Distributes requests across healthy workers |
| 202 Responses | Returns Retry-After when GPU capacity is provisioning |
The router runs as a Deployment with 2+ replicas for high availability.
router: replicas: 2 resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "2" memory: "2Gi"Worker Pools
Section titled “Worker Pools”Each GPU type runs as a separate StatefulSet with persistent storage for model caching.
| Pool | GPU | VRAM | Use Case |
|---|---|---|---|
l4 | NVIDIA L4 | 24GB | Standard inference, best price/performance |
a100-40gb | NVIDIA A100 | 40GB | Large models, high throughput |
a100-80gb | NVIDIA A100 | 80GB | Very large models (7B+ parameters) |
Worker configuration:
workers: pools: l4: enabled: true minReplicas: 0 # Scale to zero when idle maxReplicas: 10 gpuType: l4 nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 gpu: count: 1 product: NVIDIA-L4 resources: requests: cpu: "4" memory: "16Gi"Workers use a 300Gi emptyDir volume for model cache. Models load on first request.
GPU Selection
Section titled “GPU Selection”Specify the target GPU type using the X-SIE-MACHINE-PROFILE header or SDK parameter.
HTTP Header
Section titled “HTTP Header”curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "X-SIE-MACHINE-PROFILE: l4" \ -d '{"items": [{"text": "Hello world"}]}'SDK Parameter
Section titled “SDK Parameter”from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://sie.example.com")
# Route to L4 poolresult = client.encode( "BAAI/bge-m3", Item(text="Hello world"), gpu="l4")
# Route to A100 pool for large modelsresult = client.encode( "intfloat/e5-mistral-7b-instruct", Item(text="Hello world"), gpu="a100-40gb")Available GPU Types
Section titled “Available GPU Types”| GPU Type | Header Value | Machine Type |
|---|---|---|
| NVIDIA L4 | l4 | g2-standard-8 |
| NVIDIA A100 40GB | a100-40gb | a2-highgpu-1g |
| NVIDIA A100 80GB | a100-80gb | a2-ultragpu-1g |
Resource Pools
Section titled “Resource Pools”Resource pools provide tenant isolation by reserving dedicated workers.
Create a Pool via SDK
Section titled “Create a Pool via SDK”Create a pool explicitly (created lazily on first request):
from sie_sdk import SIEClientfrom sie_sdk.types import Item
# Client with dedicated pool (2 L4 workers reserved)client = SIEClient("http://sie.example.com")client.create_pool("tenant-abc", {"l4": 2})
# First request creates the pool, subsequent requests reuse itresult = client.encode( "BAAI/bge-m3", Item(text="Hello world"), gpu="tenant-abc/l4" # pool_name/gpu_type)
# Check pool statusinfo = client.get_pool("tenant-abc")print(f"Pool {info['name']}: {info['status']['state']}")
# Explicit cleanup (optional - pools are GC'd after inactivity)client.delete_pool("tenant-abc")Route to Pool via HTTP
Section titled “Route to Pool via HTTP”Use the X-SIE-Pool header:
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "X-SIE-MACHINE-PROFILE: l4" \ -H "X-SIE-Pool: tenant-abc" \ -d '{"items": [{"text": "Hello world"}]}'The SDK handles lease renewal automatically. Pools are garbage collected after inactivity.
KEDA Autoscaling
Section titled “KEDA Autoscaling”KEDA scales worker pools based on queue depth metrics from Prometheus.
Scale-from-Zero
Section titled “Scale-from-Zero”When no workers are running and a request arrives:
- Router returns
202 AcceptedwithRetry-After: 120header - Router records pending demand metric
- KEDA detects queue depth > activation threshold
- GKE provisions GPU node (60-120 seconds)
- Worker pod starts and registers with router
- Client retries and request succeeds
Configuration
Section titled “Configuration”autoscaling: enabled: true prometheusAddress: http://prometheus-operated.monitoring.svc:9090 pollingInterval: 15 # Check metrics every 15s cooldownPeriod: 900 # Wait 15 min before scaling to zero scaleDownStabilization: 300 # 5 min stabilization window queueDepthThreshold: 10 # Scale up at 10 pending requests/pod queueDepthActivation: 2 # Activate from zero at 2 requests fallbackReplicas: 2 # Fallback if Prometheus unavailableCold Start Expectations
Section titled “Cold Start Expectations”When scaling from zero, expect these timelines:
| Phase | Duration | What Happens |
|---|---|---|
| Node provisioning | 2-5 min | GKE finds a GPU node (spot may take longer) |
| Container startup | 20-40s | Pull image, start process |
| Model loading | 10-120s | Load weights to GPU (from cache or HuggingFace) |
Total: 3-7 minutes from first request to first response. See Scale-from-Zero for the full flow and troubleshooting.
Cost Optimization
Section titled “Cost Optimization”GPU nodes scale to zero during idle periods. Configure cooldown based on your traffic patterns:
- Consistent traffic: Lower cooldown (300s) for responsive scaling
- Bursty traffic: Higher cooldown (900s) to avoid thrashing
- Dev/test: Use spot instances for 60-70% cost savings
Terraform Setup
Section titled “Terraform Setup”The examples/quickstart directory provisions a complete GKE cluster with an L4 spot GPU pool via the published superlinked/sie/google Terraform registry module.
Prerequisites
Section titled “Prerequisites”- GCP project with billing enabled
- GPU quota for
nvidia-l4in your region (check withgcloud compute regions describe REGION) - Required APIs enabled:
container.googleapis.comcompute.googleapis.comartifactregistry.googleapis.comiam.googleapis.com
- Authenticated:
gcloud auth application-default login
Initialize
Section titled “Initialize”cd deploy/terraform/gcp/examples/quickstart
# Set project IDexport TF_VAR_project_id="your-project-id"
# Initialize Terraformterraform initPlan and Apply
Section titled “Plan and Apply”# Review changesterraform plan
# Deploy cluster (15-20 minutes)terraform applyConfigure kubectl
Section titled “Configure kubectl”# Get credentials$(terraform output -raw kubectl_command)
# Verify clusterkubectl get nodesVariables
Section titled “Variables”Key configuration options for the superlinked/sie/google module:
| Variable | Default | Description |
|---|---|---|
project_id | (required) | GCP project ID |
region | (required) | GKE cluster region |
cluster_name | sie-cluster | Name of the GKE cluster |
cpu_node_pool | e2-standard-4 | System/CPU pool sizing |
gpu_node_pools | L4 pool | List of GPU node pool configurations |
enable_workload_identity | true | Bind a GCP SA to a K8s SA for secure GCS access |
enable_node_auto_provisioning | false | Let GKE provision node pools automatically |
create_artifact_registry | true | Provision an Artifact Registry for custom images |
enable_cloud_logging | true | Stream cluster logs to Cloud Logging |
Example: Production Multi-GPU
Section titled “Example: Production Multi-GPU”module "sie_gke" { source = "superlinked/sie/google" version = "0.1.10"
project_id = "my-project" region = "us-central1" cluster_name = "sie-prod"
gpu_node_pools = [ { name = "l4-pool" machine_type = "g2-standard-8" gpu_type = "nvidia-l4" gpu_count = 1 min_node_count = 1 # Keep 1 warm max_node_count = 20 spot = false }, { name = "a100-pool" machine_type = "a2-highgpu-1g" gpu_type = "nvidia-tesla-a100" gpu_count = 1 min_node_count = 0 max_node_count = 10 spot = true } ]
enable_workload_identity = true sie_namespace = "sie" sie_service_account_name = "sie-server"}Helm Installation
Section titled “Helm Installation”Deploy SIE to an existing GKE cluster using Helm. The chart bundles KEDA, kube-prometheus-stack, and DCGM Exporter as sub-chart dependencies — no manual installs required.
Prerequisites
Section titled “Prerequisites”- GKE cluster with GPU node pools (the Terraform setup above creates this)
HF_TOKENexported if you need gated models. Optional for theBAAI/bge-m3smoke test — in that case, omit both--set hfToken.create=trueand--set hfToken.value=...entirely (leavingHF_TOKENunset with the flags present creates an empty-token secret that will fail later on any gated-model request).
Install
Section titled “Install”The quickstart reads the Workload Identity service account from the terraform output and wires it into the chart via --set:
# Bind the K8s SA to the GCP SA for Workload IdentityWI_SA=$(terraform output -raw workload_identity_sa)
helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \ --version 0.1.10 \ -f helm-values.yaml \ -n sie --create-namespace \ --set "serviceAccount.annotations.iam\.gke\.io/gcp-service-account=$WI_SA" \ --set hfToken.create=true \ --set hfToken.value="$HF_TOKEN"
# Wait for rolloutkubectl -n sie get pods -wCustom Values
Section titled “Custom Values”router: replicas: 3
workers: common: bundle: default cacheVolumeSize: 100Gi clusterCache: enabled: true url: gs://my-bucket/models
pools: l4: enabled: true minReplicas: 1 maxReplicas: 20
autoscaling: enabled: true cooldownPeriod: 300
ingress: enabled: true host: sie.example.com tls: enabled: true secretName: sie-tls
auth: enabled: true oauth2Proxy: oidcIssuerUrl: https://auth.example.com/realms/sie
serviceMonitor: enabled: trueUpgrade
Section titled “Upgrade”helm upgrade sie oci://ghcr.io/superlinked/charts/sie-cluster \ --version 0.1.10 \ -n sie \ -f helm-values.yamlVerify
Section titled “Verify”# Check podskubectl get pods -n sie
# Check router logskubectl logs -n sie -l app.kubernetes.io/component=router
# Port-forward and run a smoke test against the routerkubectl -n sie port-forward svc/sie-router 8080:8080 &
python3 -c "from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')result = client.encode('BAAI/bge-m3', {'text': 'hello world'}, gpu='l4', wait_for_capacity=True, provision_timeout_s=600)print(result['dense'].shape) # (1024,)"The first request after scale-from-zero takes ~5–10 minutes (node provisioning + image pull + model loading). See Scale-from-Zero for the full flow.
Cleanup
Section titled “Cleanup”helm uninstall sie -n sieterraform destroyAccess + Auth
Section titled “Access + Auth”- Ingress controller: use ingress-nginx for public or private access.
- Public vs private: set ingress-nginx service annotations for internal LBs on GKE.
- Auth options:
- OIDC (oauth2-proxy) with external IdP or Dex.
- Static token (router-level) for OSS/self-hosted without IdP.
- No auth + private ingress (internal LB).
module "sie_gke" { # Turnkey ingress controller install_ingress_nginx = true
# Private LB example ingress_nginx_service_annotations = { "cloud.google.com/load-balancer-type" = "Internal" }
# Router ingress + auth sie_ingress_enabled = true sie_ingress_host = "sie.example.com" sie_ingress_tls_enabled = true sie_ingress_tls_secret_name = "sie-tls"
sie_auth_enabled = true sie_auth_oidc_issuer_url = "https://auth.example.com/realms/sie" sie_auth_secret_name = "oauth2-proxy"
# Static token mode (alternative to OIDC) sie_router_auth_mode = "static" sie_router_auth_secret_name = "sie-router-auth"}Debug-only access via port-forward is still possible, but production paths should use ingress.
What’s Next
Section titled “What’s Next”- Upgrade Runbook - pre-upgrade checklist, rolling updates, and rollback
- Scale-from-Zero - understanding the 202 flow and cold starts
- Kubernetes in AWS - equivalent EKS deployment
- Monitoring & Observability - metrics, logging, and dashboards