How to deploy SIE

SIE has two deployment paths. Use Docker for a single sie-server with no external SIE services. Use Kubernetes when you need the clustered runtime: sie-gateway, sie-config, NATS JetStream, and GPU worker pods. Each worker pod runs the SIE server sidecar beside the Python sie-server adapter container.

Which Deployment Path Should I Use?

Use Docker if:

You are running on a single server or VM
You are in development or running a low-traffic service
You want the simplest possible setup

Use Kubernetes if:

You need horizontal scaling or autoscaling to zero
You need high availability across multiple nodes
You are deploying on GCP or AWS with GPU node pools

	Docker	Kubernetes
Setup time	Minutes	Hours
Scaling	Manual	Automatic
High availability	No	Yes
Scale-to-zero	No	Yes
Best for	Dev, single-server	Production, high traffic

See Kubernetes on GCP and Kubernetes on AWS for cloud-specific guides.

Getting Started With Docker

The fastest way to run SIE is a single docker run:

# CPU only
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default

# With GPU (recommended)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

The server starts on port 8080. Models load on first request with no pre-configuration needed.

Common Options

# Persistent model cache (avoids re-downloading on restart)
docker run --gpus all \
  -p 8080:8080 \
  -v ~/.cache/sie:/root/.cache/sie \
  ghcr.io/superlinked/sie-server:latest-cuda12-default

# Custom port
docker run --gpus all -p 3000:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

# Specific models only (faster startup)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default \
  sie-server serve -m BAAI/bge-m3,BAAI/bge-reranker-v2-m3

# Persistent model cache (skip re-downloads)
docker run --gpus all -p 8080:8080 \
  -v ~/.cache/huggingface:/app/.cache/huggingface \
  ghcr.io/superlinked/sie-server:latest-cuda12-default

# Different bundle (e.g. SGLang backend for large LLM embeddings)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-sglang

See the full Docker deployment guide.

What Hardware Does SIE Need?

Minimum Specs

Component	Minimum	Recommended
CPU	4 cores	8+ cores
RAM	8GB	16GB+
GPU	Optional	Any NVIDIA with 16GB+ VRAM
Disk	20GB	100GB+ for model cache

GPU Recommendations by Workload

GPU	VRAM	Best for
T4	16GB	Development, light production
L4	24GB	Standard production (recommended starting point)
A100 40GB	40GB	High-throughput or large model serving
A100 80GB	80GB	7B+ parameter models

See Hardware and Capacity for full sizing guidance.

When Should I Move to Kubernetes?

Move from Docker to Kubernetes when you need:

Autoscaling to handle traffic spikes by spinning up additional workers
Scale-to-zero to save costs by scaling down during idle periods
High availability with multiple replicas to survive node failures
Multi-region deployment to serve users in different geographies

Note: Kubernetes clusters with scale-to-zero have cold start times of 5 to 7 minutes. Use wait_for_capacity=True in the Python SDK (or waitForCapacity: true in TypeScript) to handle this gracefully. See Scale-from-Zero and Autoscaling.

Kubernetes Cluster Prerequisites

These requirements apply to any Kubernetes install path. The Terraform examples for GCP and AWS provision a cluster that satisfies all of them. Operators using helm install against an existing cluster must confirm each item first.

Cluster

Kubernetes 1.29 or newer. The AWS Terraform example pins to 1.35; the GCP example follows the cluster’s release channel. Older versions are untested.
Worker nodes with NVIDIA GPUs (L4, A100 40GB, or A100 80GB). CPU-only worker pools exist for local testing but are not a supported production target.
NVIDIA device plugin installed and exposing nvidia.com/gpu as a schedulable resource. GKE ships this on GPU node pools automatically; EKS does not.
Node disk ≥ 350Gi per GPU node. Workers cache models in a 300Gi emptyDir (no PVC, no storage class needed for the cache itself).

In-cluster components

Ingress controller. The chart defaults to ingressClassName: nginx. Install ingress-nginx if you plan to expose the gateway publicly. Port-forward works for smoke tests and internal-only setups.
cert-manager (optional). Required only if you want the chart to issue Let’s Encrypt certificates via HTTP-01. BYO TLS via a kubernetes.io/tls Secret is also supported and is the default.
Storage class. Only matters if you enable the sie-config PVC (1Gi, default off). The cluster default class is fine.
KEDA, Prometheus, Loki, Alloy, DCGM Exporter. Packaged as optional sub-charts (keda.install=true, kube-prometheus-stack.install=true, etc.). Skip them for a minimal smoke test; enable for autoscaling and observability.

Cluster identity

Workload Identity (GCP) or IRSA (AWS) bound to a service account named sie-server in the SIE release namespace. This is how worker pods read the model cache bucket (GCS or S3) without static credentials. The Terraform examples create and bind this for you.

Network egress

The cluster must reach:

ghcr.io for chart images (sie-gateway, sie-server, sie-server-sidecar, sie-config) and the OCI chart itself
huggingface.co for model weights on first request (unless you pre-populate a cluster cache bucket via sie-admin cache weights sync)

Air-gapped environments must mirror both registries and configure workers.common.clusterCache.url to a pre-populated S3 or GCS bucket.

Tokens and secrets

HF_TOKEN required for gated HuggingFace models (e.g. google/embeddinggemma-300m, naver/splade-v3). Optional for the BAAI/bge-m3 smoke test.

For cloud-account-level requirements (GCP project, GPU quotas, IAM roles, API enablement), see the Prerequisites section on the GCP or AWS page.

Frequently Asked Questions

Can SIE run without a GPU? Yes. SIE runs on CPU and works well for development and low-traffic workloads. For production inference at scale, a GPU is strongly recommended, especially for batch encoding. See Hardware and Capacity.

How do I monitor a SIE deployment? SIE exposes Prometheus metrics and structured logs. See Monitoring and Observability for dashboards, alerting, and log configuration.

How do I tune SIE for better performance? The main levers are batch size, worker concurrency, and model preloading. See Performance Tuning for a step-by-step guide.

How do I upgrade SIE without downtime? See the Upgrade Runbook for rolling upgrade procedures on both Docker and Kubernetes.

Is there a managed cloud option? Superlinked offers managed SIE deployments for teams that do not want to manage infrastructure themselves. Contact us to learn more.