Skip to content
Why did we open-source our inference engine? Read the post

How to deploy SIE

SIE deploys as a single container with no external dependencies. There are two deployment paths: Docker for simplicity, and Kubernetes for scaling and high availability. Both use the same image. There is no separate production build.


Use Docker if:

  • You are running on a single server or VM
  • You are in development or running a low-traffic service
  • You want the simplest possible setup

Use Kubernetes if:

  • You need horizontal scaling or autoscaling to zero
  • You need high availability across multiple nodes
  • You are deploying on GCP or AWS with GPU node pools
DockerKubernetes
Setup timeMinutesHours
ScalingManualAutomatic
High availabilityNoYes
Scale-to-zeroNoYes
Best forDev, single-serverProduction, high traffic

See Kubernetes on GCP and Kubernetes on AWS for cloud-specific guides.


The fastest way to run SIE is a single docker run:

# CPU only
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default
# With GPU (recommended)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

The server starts on port 8080. Models load on first request with no pre-configuration needed.

# Persistent model cache (avoids re-downloading on restart)
docker run --gpus all \
-p 8080:8080 \
-v ~/.cache/sie:/root/.cache/sie \
ghcr.io/superlinked/sie-server:default
# Custom port
docker run --gpus all -p 3000:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
# Specific models only (faster startup)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default \
sie-server serve -m BAAI/bge-m3,BAAI/bge-reranker-v2-m3
# Persistent model cache (skip re-downloads)
docker run --gpus all -p 8080:8080 \
-v ~/.cache/huggingface:/app/.cache/huggingface \
ghcr.io/superlinked/sie-server:latest-cuda12-default
# Different bundle (e.g. SGLang backend for large LLM embeddings)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-sglang

See the full Docker deployment guide.


ComponentMinimumRecommended
CPU4 cores8+ cores
RAM8GB16GB+
GPUOptionalAny NVIDIA with 16GB+ VRAM
Disk20GB100GB+ for model cache
GPUVRAMBest for
T416GBDevelopment, light production
L424GBStandard production (recommended starting point)
A100 40GB40GBHigh-throughput or large model serving
A100 80GB80GB7B+ parameter models

See Hardware and Capacity for full sizing guidance.


Move from Docker to Kubernetes when you need:

  • Autoscaling to handle traffic spikes by spinning up additional workers
  • Scale-to-zero to save costs by scaling down during idle periods
  • High availability with multiple replicas to survive node failures
  • Multi-region deployment to serve users in different geographies

Note: Kubernetes clusters with scale-to-zero have cold start times of 5 to 7 minutes. Use wait_for_capacity=True in the Python SDK (or waitForCapacity: true in TypeScript) to handle this gracefully. See Scale-from-Zero and Autoscaling.


These requirements apply to any Kubernetes install path. The Terraform examples for GCP and AWS provision a cluster that satisfies all of them. Operators using helm install against an existing cluster must confirm each item first.

  • Kubernetes 1.29 or newer. The AWS Terraform example pins to 1.35; the GCP example follows the cluster’s release channel. Older versions are untested.
  • Worker nodes with NVIDIA GPUs (L4, A100 40GB, or A100 80GB). CPU-only worker pools exist for local testing but are not a supported production target.
  • NVIDIA device plugin installed and exposing nvidia.com/gpu as a schedulable resource. GKE ships this on GPU node pools automatically; EKS does not.
  • Node disk ≥ 350Gi per GPU node. Workers cache models in a 300Gi emptyDir (no PVC, no storage class needed for the cache itself).
  • Ingress controller. The chart defaults to ingressClassName: nginx. Install ingress-nginx if you plan to expose the gateway publicly. Port-forward works for smoke tests and internal-only setups.
  • cert-manager (optional). Required only if you want the chart to issue Let’s Encrypt certificates via HTTP-01. BYO TLS via a kubernetes.io/tls Secret is also supported and is the default.
  • Storage class. Only matters if you enable the sie-config PVC (1Gi, default off). The cluster default class is fine.
  • KEDA, Prometheus, Loki, Alloy, DCGM Exporter. Packaged as optional sub-charts (keda.install=true, kube-prometheus-stack.install=true, etc.). Skip them for a minimal smoke test; enable for autoscaling and observability.
  • Workload Identity (GCP) or IRSA (AWS) bound to a service account named sie-server in the SIE release namespace. This is how worker pods read the model cache bucket (GCS or S3) without static credentials. The Terraform examples create and bind this for you.

The cluster must reach:

  • ghcr.io for chart images (sie-gateway, sie-server, sie-config) and the OCI chart itself
  • huggingface.co for model weights on first request (unless you pre-populate a cluster cache bucket via sie-admin cache weights sync)

Air-gapped environments must mirror both registries and configure workers.common.clusterCache.url to a pre-populated S3 or GCS bucket.

  • HF_TOKEN required for gated HuggingFace models (e.g. google/embeddinggemma-300m, naver/splade-v3). Optional for the BAAI/bge-m3 smoke test.

For cloud-account-level requirements (GCP project, GPU quotas, IAM roles, API enablement), see the Prerequisites section on the GCP or AWS page.


Can SIE run without a GPU? Yes. SIE runs on CPU and works well for development and low-traffic workloads. For production inference at scale, a GPU is strongly recommended, especially for batch encoding. See Hardware and Capacity.

How do I monitor a SIE deployment? SIE exposes Prometheus metrics and structured logs. See Monitoring and Observability for dashboards, alerting, and log configuration.

How do I tune SIE for better performance? The main levers are batch size, worker concurrency, and model preloading. See Performance Tuning for a step-by-step guide.

How do I upgrade SIE without downtime? See the Upgrade Runbook for rolling upgrade procedures on both Docker and Kubernetes.

Is there a managed cloud option? Superlinked offers managed SIE deployments for teams that do not want to manage infrastructure themselves. Contact us to learn more.

Contact us

Tell us about your use case and we'll get back to you shortly.