Kubernetes in AWS
Deploy SIE to Amazon EKS with GPU node pools, KEDA autoscaling, and Terraform automation.
Architecture
Section titled “Architecture”The architecture mirrors the GCP deployment - a router-worker setup with KEDA autoscaling:
Components:
- EKS Cluster with managed node groups for GPU instances
- NVIDIA Device Plugin for GPU scheduling
- IRSA (IAM Roles for Service Accounts) for S3 access
- KEDA for autoscaling based on queue depth metrics
- Prometheus + Grafana + DCGM Exporter for observability
Terraform Setup
Section titled “Terraform Setup”The examples/quickstart directory consumes the published superlinked/sie/aws Terraform registry module — the same module used in production deployments, pinned to a known-good version.
Prerequisites
Section titled “Prerequisites”- AWS account with appropriate permissions
- EC2 quota for
g6.xlarge(NVIDIA L4) in your target region (default:eu-central-1) - Terraform >= 1.14 and AWS CLI v2 configured
Deploy
Section titled “Deploy”cd deploy/terraform/aws/examples/quickstart
# Initialize and apply (creates an EKS cluster — ~15-20 min)terraform initterraform applyThe quickstart main.tf pins the module version:
module "sie_eks" { source = "superlinked/sie/aws" version = "0.1.10"
aws_region = var.aws_region project_name = var.project_name
gpu_node_groups = [ { name = "l4-spot" instance_type = "g6.xlarge" capacity_type = "SPOT" min_size = 0 max_size = 5 labels = { "sie.superlinked.com/gpu-type" = "nvidia-l4" } }, ]}What Gets Created
Section titled “What Gets Created”The Terraform module provisions:
| Resource | Purpose |
|---|---|
| EKS Cluster | Kubernetes control plane |
| GPU Node Group | Auto-scaling g6.xlarge L4 spot instances (0–5 nodes) |
| NVIDIA Device Plugin | GPU scheduling in Kubernetes |
| IRSA Role | Workload identity for SIE pods (no static AWS credentials) |
| ECR Repositories | Private registries for custom sie-server/sie-router images |
Helm Installation
Section titled “Helm Installation”Once the cluster is up, configure kubectl and install the sie-cluster chart. The chart bundles KEDA, kube-prometheus-stack, and DCGM Exporter as sub-charts — no manual installs required.
# Configure kubectl from the terraform output$(terraform output -raw kubectl_config_command)
# Install SIE (pulls the chart from GHCR, applies the quickstart values,# wires up IRSA from the terraform output)IRSA_ARN=$(terraform output -raw sie_irsa_role_arn)
helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \ --version 0.1.10 \ -f helm-values.yaml \ -n sie --create-namespace \ --set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=$IRSA_ARN" \ --set hfToken.create=true \ --set hfToken.value="$HF_TOKEN"
# Wait for rolloutkubectl -n sie get pods -wSet HF_TOKEN beforehand if you need gated models. For the smoke test below (BAAI/bge-m3) it is optional — in that case, omit both --set hfToken.create=true and --set hfToken.value=... entirely (leaving HF_TOKEN unset with the flags present creates an empty-token secret that will fail later on any gated-model request).
Smoke Test
Section titled “Smoke Test”kubectl -n sie port-forward svc/sie-router 8080:8080 &
python3 -c "from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')result = client.encode('BAAI/bge-m3', {'text': 'hello world'}, gpu='l4', wait_for_capacity=True, provision_timeout_s=600)print(result['dense'].shape) # (1024,)"The first request after scale-from-zero takes ~5–10 minutes (node provisioning + image pull + model loading). See Scale-from-Zero for the full flow.
Cleanup
Section titled “Cleanup”helm uninstall sie -n sieterraform destroyDifferences from GCP
Section titled “Differences from GCP”| Feature | GCP (GKE) | AWS (EKS) |
|---|---|---|
| GPU scheduling | Native GKE support | NVIDIA Device Plugin required |
| IAM for pods | Workload Identity | IRSA |
| Model cache storage | GCS (gs://) | S3 (s3://) |
| Node provisioning | GKE Autopilot / NAP | Karpenter or Cluster Autoscaler |
| Spot instances | Spot VMs | Spot Instances |
S3 for Model Cache
Section titled “S3 for Model Cache”Configure the cluster cache to use S3:
workers: common: clusterCache: enabled: true url: s3://my-bucket/modelsIRSA handles authentication automatically - no access keys needed in the pod.
Security Considerations
Section titled “Security Considerations”The default Terraform configuration exposes the API endpoint publicly. For production:
- Restrict ingress to your VPC CIDR or specific IP ranges
- Enable authentication via oauth2-proxy or static tokens
- Use a private load balancer for internal-only access:
ingress: enabled: true annotations: service.beta.kubernetes.io/aws-load-balancer-internal: "true"Docker on AWS (Alternative)
Section titled “Docker on AWS (Alternative)”For simpler deployments, run SIE directly on a GPU EC2 instance:
# On a g6.xlarge (NVIDIA L4) instancesudo apt-get install -y nvidia-container-toolkitsudo systemctl restart docker
docker run --gpus all -p 8080:8080 \ -v ~/.cache/huggingface:/app/.cache/huggingface \ ghcr.io/superlinked/sie-server:defaultThis is simpler than EKS and suitable for single-instance production workloads.
What’s Next
Section titled “What’s Next”- Upgrade Runbook - pre-upgrade checklist, rolling updates, and rollback
- Scale-from-Zero - understanding the 202 flow and cold starts
- Monitoring - metrics, alerts, and dashboards
- Kubernetes in GCP - equivalent GKE deployment