Skip to content
Why did we open-source our inference engine? Read the post

Kubernetes in AWS

Deploy SIE to Amazon EKS with GPU node pools, KEDA autoscaling, and Terraform automation.

The architecture mirrors the GCP deployment - a router-worker setup with KEDA autoscaling:

EKS cluster architecture with Router, L4 and A100 worker pools, KEDA, and Prometheus

Components:

  • EKS Cluster with managed node groups for GPU instances
  • NVIDIA Device Plugin for GPU scheduling
  • IRSA (IAM Roles for Service Accounts) for S3 access
  • KEDA for autoscaling based on queue depth metrics
  • Prometheus + Grafana + DCGM Exporter for observability

The examples/quickstart directory consumes the published superlinked/sie/aws Terraform registry module — the same module used in production deployments, pinned to a known-good version.

  1. AWS account with appropriate permissions
  2. EC2 quota for g6.xlarge (NVIDIA L4) in your target region (default: eu-central-1)
  3. Terraform >= 1.14 and AWS CLI v2 configured
Terminal window
cd deploy/terraform/aws/examples/quickstart
# Initialize and apply (creates an EKS cluster — ~15-20 min)
terraform init
terraform apply

The quickstart main.tf pins the module version:

module "sie_eks" {
source = "superlinked/sie/aws"
version = "0.1.10"
aws_region = var.aws_region
project_name = var.project_name
gpu_node_groups = [
{
name = "l4-spot"
instance_type = "g6.xlarge"
capacity_type = "SPOT"
min_size = 0
max_size = 5
labels = {
"sie.superlinked.com/gpu-type" = "nvidia-l4"
}
},
]
}

The Terraform module provisions:

ResourcePurpose
EKS ClusterKubernetes control plane
GPU Node GroupAuto-scaling g6.xlarge L4 spot instances (0–5 nodes)
NVIDIA Device PluginGPU scheduling in Kubernetes
IRSA RoleWorkload identity for SIE pods (no static AWS credentials)
ECR RepositoriesPrivate registries for custom sie-server/sie-router images

Once the cluster is up, configure kubectl and install the sie-cluster chart. The chart bundles KEDA, kube-prometheus-stack, and DCGM Exporter as sub-charts — no manual installs required.

Terminal window
# Configure kubectl from the terraform output
$(terraform output -raw kubectl_config_command)
# Install SIE (pulls the chart from GHCR, applies the quickstart values,
# wires up IRSA from the terraform output)
IRSA_ARN=$(terraform output -raw sie_irsa_role_arn)
helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.1.10 \
-f helm-values.yaml \
-n sie --create-namespace \
--set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=$IRSA_ARN" \
--set hfToken.create=true \
--set hfToken.value="$HF_TOKEN"
# Wait for rollout
kubectl -n sie get pods -w

Set HF_TOKEN beforehand if you need gated models. For the smoke test below (BAAI/bge-m3) it is optional — in that case, omit both --set hfToken.create=true and --set hfToken.value=... entirely (leaving HF_TOKEN unset with the flags present creates an empty-token secret that will fail later on any gated-model request).

Terminal window
kubectl -n sie port-forward svc/sie-router 8080:8080 &
python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
gpu='l4', wait_for_capacity=True, provision_timeout_s=600)
print(result['dense'].shape) # (1024,)
"

The first request after scale-from-zero takes ~5–10 minutes (node provisioning + image pull + model loading). See Scale-from-Zero for the full flow.

Terminal window
helm uninstall sie -n sie
terraform destroy

FeatureGCP (GKE)AWS (EKS)
GPU schedulingNative GKE supportNVIDIA Device Plugin required
IAM for podsWorkload IdentityIRSA
Model cache storageGCS (gs://)S3 (s3://)
Node provisioningGKE Autopilot / NAPKarpenter or Cluster Autoscaler
Spot instancesSpot VMsSpot Instances

Configure the cluster cache to use S3:

workers:
common:
clusterCache:
enabled: true
url: s3://my-bucket/models

IRSA handles authentication automatically - no access keys needed in the pod.


The default Terraform configuration exposes the API endpoint publicly. For production:

  • Restrict ingress to your VPC CIDR or specific IP ranges
  • Enable authentication via oauth2-proxy or static tokens
  • Use a private load balancer for internal-only access:
ingress:
enabled: true
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"

For simpler deployments, run SIE directly on a GPU EC2 instance:

Terminal window
# On a g6.xlarge (NVIDIA L4) instance
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
docker run --gpus all -p 8080:8080 \
-v ~/.cache/huggingface:/app/.cache/huggingface \
ghcr.io/superlinked/sie-server:default

This is simpler than EKS and suitable for single-instance production workloads.


Contact us

Tell us about your use case and we'll get back to you shortly.