Skip to content
Why did we open-source our inference engine? Read the post

Kubernetes in GCP

Deploy SIE to GKE with GPU node pools, KEDA autoscaling, and Terraform automation.

There are two install paths for GKE. Confirm the items under the path you plan to take before running any commands.

Path A. Terraform and Helm (module provisions the cluster)

Section titled “Path A. Terraform and Helm (module provisions the cluster)”
  1. GCP project with billing enabled.

  2. IAM permissions on the project sufficient to create VPC, GKE, IAM, and Artifact Registry resources. roles/owner works; for a least-privilege setup combine roles/container.admin, roles/compute.admin, roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin, and roles/artifactregistry.admin.

  3. GPU quota for nvidia-l4 in your region. The dev-l4-spot example uses spot, so check PREEMPTIBLE_NVIDIA_L4_GPUS. Anything ≥ 4 covers the example’s max of 5 nodes × 1 GPU.

    gcloud compute regions describe REGION \
    --format='table(quotas.filter(metric:NVIDIA))'
  4. Required GCP APIs enabled:

    gcloud services enable \
    container.googleapis.com \
    compute.googleapis.com \
    artifactregistry.googleapis.com \
    iam.googleapis.com
  5. Local tooling: Terraform ≥ 1.14, gcloud CLI, kubectl, and helm ≥ 3.13.

  6. Authenticated:

    gcloud auth application-default login
  1. Cluster meets the generic Kubernetes Cluster Prerequisites (k8s version, GPU device plugin, ingress controller, network egress).
  2. GPU node pool with the nvidia-l4, nvidia-tesla-a100, or nvidia-a100-80gb accelerator and the cloud.google.com/gke-accelerator node label. The chart’s pool defaults match GKE-managed GPU pool labels.
  3. Workload Identity enabled on the cluster, with a GCP service account that can read your model-cache GCS bucket. The chart’s Kubernetes ServiceAccount is named sie-server and must be annotated with iam.gke.io/gcp-service-account=<your-gsa-email>.
  4. Artifact Registry decision. Let the chart’s images pull from public GHCR (default), or mirror to a private Artifact Registry repo and override image.repository per component.
  5. kubectl authenticated against the target cluster (gcloud container clusters get-credentials ...).

SIE runs as a gateway/config/worker architecture on Kubernetes:

GKE cluster architecture with Gateway, Config service, L4 and A100 worker pools, KEDA, and Prometheus

Components:

  • Gateway - Stateless Rust inference edge that routes requests to GPU-specific worker pools through NATS JetStream
  • Config service - Single-writer control plane for runtime model configuration
  • Worker Pools - StatefulSets grouped by GPU type (L4, A100-40GB, A100-80GB)
  • KEDA - Scales worker pools from zero based on queue depth metrics
  • Prometheus - Provides metrics for autoscaling decisions

The gateway is a stateless Rust service that handles GPU-aware routing:

FeatureDescription
GPU RoutingRoutes requests to appropriate GPU pool via X-SIE-MACHINE-PROFILE header
Pool RoutingSupports tenant isolation via X-SIE-Pool header
Queue RoutingPublishes work to the selected pool’s NATS JetStream queue
Config ReadsMirrors model and bundle state from sie-config
202 ResponsesReturns Retry-After when GPU capacity is provisioning

The gateway runs as a Deployment with 2+ replicas for high availability.

gateway:
replicas: 2
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"

Each GPU type runs as a separate StatefulSet with persistent storage for model caching.

PoolGPUVRAMUse Case
l4NVIDIA L424GBStandard inference, best price/performance
a100-40gbNVIDIA A10040GBLarge models, high throughput
a100-80gbNVIDIA A10080GBVery large models (7B+ parameters)

Worker configuration:

workers:
pools:
l4:
enabled: true
minReplicas: 0 # Scale to zero when idle
maxReplicas: 10
gpuType: l4
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
gpu:
count: 1
product: NVIDIA-L4
resources:
requests:
cpu: "4"
memory: "16Gi"

Workers use a 300Gi emptyDir volume for model cache. Models load on first request.


Specify the target GPU type using the X-SIE-MACHINE-PROFILE header or SDK parameter.

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "X-SIE-MACHINE-PROFILE: l4" \
-d '{"items": [{"text": "Hello world"}]}'
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://sie.example.com")
# Route to L4 pool
result = client.encode(
"BAAI/bge-m3",
Item(text="Hello world"),
gpu="l4"
)
# Route to A100 pool for large models
result = client.encode(
"intfloat/e5-mistral-7b-instruct",
Item(text="Hello world"),
gpu="a100-40gb"
)
GPU TypeHeader ValueMachine Type
NVIDIA L4l4g2-standard-8
NVIDIA A100 40GBa100-40gba2-highgpu-1g
NVIDIA A100 80GBa100-80gba2-ultragpu-1g

Resource pools provide tenant isolation by reserving dedicated workers.

Create a pool explicitly (created lazily on first request):

from sie_sdk import SIEClient
from sie_sdk.types import Item
# Client with dedicated pool (2 L4 workers reserved)
client = SIEClient("http://sie.example.com")
client.create_pool("tenant-abc", {"l4": 2})
# First request creates the pool, subsequent requests reuse it
result = client.encode(
"BAAI/bge-m3",
Item(text="Hello world"),
gpu="tenant-abc/l4" # pool_name/gpu_type
)
# Check pool status
info = client.get_pool("tenant-abc")
print(f"Pool {info['name']}: {info['status']['state']}")
# Explicit cleanup (optional - pools are GC'd after inactivity)
client.delete_pool("tenant-abc")

Use the X-SIE-Pool header:

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "X-SIE-MACHINE-PROFILE: l4" \
-H "X-SIE-Pool: tenant-abc" \
-d '{"items": [{"text": "Hello world"}]}'

The SDK handles lease renewal automatically. Pools are garbage collected after inactivity.


KEDA scales worker pools based on queue depth metrics from Prometheus.

When no workers are running and a request arrives:

  1. Gateway returns 202 Accepted with Retry-After: 120 header
  2. Gateway records pending demand metric
  3. KEDA detects queue depth > activation threshold
  4. GKE provisions GPU node (60-120 seconds)
  5. Worker pod starts and registers with the gateway
  6. Client retries and request succeeds
autoscaling:
enabled: true
prometheusAddress: http://prometheus-operated.monitoring.svc:9090
pollingInterval: 15 # Check metrics every 15s
cooldownPeriod: 900 # Wait 15 min before scaling to zero
scaleDownStabilization: 300 # 5 min stabilization window
queueDepthThreshold: 10 # Scale up at 10 pending requests/pod
queueDepthActivation: 2 # Activate from zero at 2 requests
fallbackReplicas: 2 # Fallback if Prometheus unavailable

When scaling from zero, expect these timelines:

PhaseDurationWhat Happens
Node provisioning2-5 minGKE finds a GPU node (spot may take longer)
Container startup20-40sPull image, start process
Model loading10-120sLoad weights to GPU (from cache or HuggingFace)

Total: 3-7 minutes from first request to first response. See Scale-from-Zero for the full flow and troubleshooting.

GPU nodes scale to zero during idle periods. Configure cooldown based on your traffic patterns:

  • Consistent traffic: Lower cooldown (300s) for responsive scaling
  • Bursty traffic: Higher cooldown (900s) to avoid thrashing
  • Dev/test: Use spot instances for 60-70% cost savings

The examples/dev-l4-spot example in superlinked/terraform-google-sie provisions a complete GKE cluster with an L4 spot GPU pool via the published superlinked/sie/google Terraform registry module.

See Path A. Terraform and Helm in the Prerequisites section at the top of this page.

git clone https://github.com/superlinked/terraform-google-sie.git
cd terraform-google-sie/examples/dev-l4-spot
# Set project ID
export TF_VAR_project_id="your-project-id"
# Initialize Terraform
terraform init
# Review changes
terraform plan
# Deploy cluster (15-20 minutes)
terraform apply
# Get credentials
$(terraform output -raw kubectl_command)
# Verify cluster
kubectl get nodes

Key configuration options for the superlinked/sie/google module:

VariableDefaultDescription
project_id(required)GCP project ID
regionus-central1GKE cluster region
cluster_namesie-devName of the GKE cluster
gpu_node_poolsL4 poolList of GPU node pool configurations
create_artifact_registrytrueProvision an Artifact Registry for custom images
deployer_service_account""Email of the SA running Terraform (optional, for CI/CD)
module "sie_gke" {
source = "superlinked/sie/google"
version = "0.3.4"
project_id = "my-project"
region = "us-central1"
cluster_name = "sie-prod"
gpu_node_pools = [
{
name = "l4-pool"
machine_type = "g2-standard-8"
gpu_type = "nvidia-l4"
gpu_count = 1
min_node_count = 1 # Keep 1 warm
max_node_count = 20
spot = false
},
{
name = "a100-pool"
machine_type = "a2-highgpu-1g"
gpu_type = "nvidia-tesla-a100"
gpu_count = 1
min_node_count = 0
max_node_count = 10
spot = true
}
]
}

Deploy SIE to an existing GKE cluster using Helm. The chart packages KEDA, kube-prometheus-stack, DCGM Exporter, Loki, and Alloy as optional sub-charts; they default to install: false. The smoke test below works with just the core services (gateway, config, worker, NATS). To enable the KEDA-based autoscaling and the observability stack described elsewhere on this page, add the following to the install command:

--set keda.install=true \
--set autoscaling.enabled=true \
--set kube-prometheus-stack.install=true \
--set dcgm-exporter.install=true

See Path B. Helm into an existing GKE cluster at the top of this page. For gated models, export HF_TOKEN first; optional for the BAAI/bge-m3 smoke test. Omit both --set hfToken.create=true and --set hfToken.value=... entirely if you do not need it (leaving HF_TOKEN unset with the flags present creates an empty-token secret that will fail later on any gated-model request).

Extract the Workload Identity service-account email from the terraform output and wire it into the chart via --set. The example also enables the L4 worker pool explicitly — the chart’s worker pools default to enabled: false.

# The `workload_identity_annotation` output is the full `key=email` pair;
# strip the prefix to get just the SA email for the --set value.
WI_SA=$(terraform output -raw workload_identity_annotation | cut -d= -f2)
helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.3.4 \
-n sie --create-namespace \
--set "serviceAccount.annotations.iam\.gke\.io/gcp-service-account=$WI_SA" \
--set workers.pools.l4.enabled=true \
--set workers.pools.l4.minReplicas=1 \
--set hfToken.create=true \
--set hfToken.value="$HF_TOKEN"
# Wait for rollout
kubectl -n sie get pods -w

minReplicas: 1 keeps one L4 worker always running, which is the simplest path to a working smoke test without KEDA installed. For true scale-from-zero, additionally pass --set keda.install=true --set autoscaling.enabled=true and set minReplicas: 0.

# custom-values.yaml
gateway:
replicas: 3
workers:
common:
bundle: default
cacheVolumeSize: 100Gi
clusterCache:
enabled: true
url: gs://my-bucket/models
pools:
l4:
enabled: true
minReplicas: 1
maxReplicas: 20
autoscaling:
enabled: true
cooldownPeriod: 300
ingress:
enabled: true
host: sie.example.com
tls:
enabled: true
secretName: sie-tls
auth:
enabled: true
oauth2Proxy:
oidcIssuerUrl: https://auth.example.com/realms/sie
serviceMonitor:
enabled: true
helm upgrade sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.3.4 \
-n sie
# Check pods
kubectl get pods -n sie
# Check gateway logs
kubectl logs -n sie -l app.kubernetes.io/component=gateway
# Port-forward the gateway and run a smoke test
kubectl -n sie port-forward svc/sie-sie-cluster-gateway 8080:8080 &
# Install the Python SDK (requires Python 3.12 — see the SDK README for newer/older Python notes)
pip install sie-sdk
python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
gpu='l4', wait_for_capacity=True, provision_timeout_s=600)
print(result['dense'].shape) # (1024,)
"

The first request after scale-from-zero takes ~5–10 minutes (node provisioning + image pull + model loading). See Scale-from-Zero for the full flow.

helm uninstall sie -n sie
terraform destroy
  • Ingress controller: use ingress-nginx for public or private access.
  • Public vs private: set ingress-nginx service annotations for internal LBs on GKE.
  • Auth options:
    • OIDC (oauth2-proxy) with external IdP or Dex.
    • Static token (gateway-level) for OSS/self-hosted without IdP.
    • No auth + private ingress (internal LB).
# Static token mode for self-hosted clusters
kubectl create secret generic sie-auth-tokens -n sie \
--from-literal=SIE_AUTH_TOKEN="key1,key2,key3"
helm upgrade sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.3.4 \
-n sie \
--set gateway.auth.mode=static \
--set gateway.auth.tokenSecretName=sie-auth-tokens

Debug-only access via port-forward is still possible, but production paths should use ingress.


Contact us

Tell us about your use case and we'll get back to you shortly.