How to deploy SIE
SIE deploys as a single container with no external dependencies. There are two deployment paths: Docker for simplicity, and Kubernetes for scaling and high availability. Both use the same image. There is no separate production build.
Which Deployment Path Should I Use?
Section titled “Which Deployment Path Should I Use?”Use Docker if:
- You are running on a single server or VM
- You are in development or running a low-traffic service
- You want the simplest possible setup
Use Kubernetes if:
- You need horizontal scaling or autoscaling to zero
- You need high availability across multiple nodes
- You are deploying on GCP or AWS with GPU node pools
| Docker | Kubernetes | |
|---|---|---|
| Setup time | Minutes | Hours |
| Scaling | Manual | Automatic |
| High availability | No | Yes |
| Scale-to-zero | No | Yes |
| Best for | Dev, single-server | Production, high traffic |
See Kubernetes on GCP and Kubernetes on AWS for cloud-specific guides.
Getting Started With Docker
Section titled “Getting Started With Docker”The fastest way to run SIE is a single docker run:
# CPU onlydocker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default
# With GPU (recommended)docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-defaultThe server starts on port 8080. Models load on first request with no pre-configuration needed.
Common Options
Section titled “Common Options”# Persistent model cache (avoids re-downloading on restart)docker run --gpus all \ -p 8080:8080 \ -v ~/.cache/sie:/root/.cache/sie \ ghcr.io/superlinked/sie-server:default
# Custom portdocker run --gpus all -p 3000:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
# Specific models only (faster startup)docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default \ sie-server serve -m BAAI/bge-m3,BAAI/bge-reranker-v2-m3
# Persistent model cache (skip re-downloads)docker run --gpus all -p 8080:8080 \ -v ~/.cache/huggingface:/app/.cache/huggingface \ ghcr.io/superlinked/sie-server:latest-cuda12-default
# Different bundle (e.g. SGLang backend for large LLM embeddings)docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-sglangSee the full Docker deployment guide.
What Hardware Does SIE Need?
Section titled “What Hardware Does SIE Need?”Minimum Specs
Section titled “Minimum Specs”| Component | Minimum | Recommended |
|---|---|---|
| CPU | 4 cores | 8+ cores |
| RAM | 8GB | 16GB+ |
| GPU | Optional | Any NVIDIA with 16GB+ VRAM |
| Disk | 20GB | 100GB+ for model cache |
GPU Recommendations by Workload
Section titled “GPU Recommendations by Workload”| GPU | VRAM | Best for |
|---|---|---|
| T4 | 16GB | Development, light production |
| L4 | 24GB | Standard production (recommended starting point) |
| A100 40GB | 40GB | High-throughput or large model serving |
| A100 80GB | 80GB | 7B+ parameter models |
See Hardware and Capacity for full sizing guidance.
When Should I Move to Kubernetes?
Section titled “When Should I Move to Kubernetes?”Move from Docker to Kubernetes when you need:
- Autoscaling to handle traffic spikes by spinning up additional workers
- Scale-to-zero to save costs by scaling down during idle periods
- High availability with multiple replicas to survive node failures
- Multi-region deployment to serve users in different geographies
Note: Kubernetes clusters with scale-to-zero have cold start times of 5 to 7 minutes. Use wait_for_capacity=True in the Python SDK (or waitForCapacity: true in TypeScript) to handle this gracefully. See Scale-from-Zero and Autoscaling.
Kubernetes Cluster Prerequisites
Section titled “Kubernetes Cluster Prerequisites”These requirements apply to any Kubernetes install path. The Terraform examples for GCP and AWS provision a cluster that satisfies all of them. Operators using helm install against an existing cluster must confirm each item first.
Cluster
Section titled “Cluster”- Kubernetes 1.29 or newer. The AWS Terraform example pins to 1.35; the GCP example follows the cluster’s release channel. Older versions are untested.
- Worker nodes with NVIDIA GPUs (L4, A100 40GB, or A100 80GB). CPU-only worker pools exist for local testing but are not a supported production target.
- NVIDIA device plugin installed and exposing
nvidia.com/gpuas a schedulable resource. GKE ships this on GPU node pools automatically; EKS does not. - Node disk ≥ 350Gi per GPU node. Workers cache models in a 300Gi
emptyDir(no PVC, no storage class needed for the cache itself).
In-cluster components
Section titled “In-cluster components”- Ingress controller. The chart defaults to
ingressClassName: nginx. Install ingress-nginx if you plan to expose the gateway publicly. Port-forward works for smoke tests and internal-only setups. - cert-manager (optional). Required only if you want the chart to issue Let’s Encrypt certificates via HTTP-01. BYO TLS via a
kubernetes.io/tlsSecret is also supported and is the default. - Storage class. Only matters if you enable the
sie-configPVC (1Gi, default off). The cluster default class is fine. - KEDA, Prometheus, Loki, Alloy, DCGM Exporter. Packaged as optional sub-charts (
keda.install=true,kube-prometheus-stack.install=true, etc.). Skip them for a minimal smoke test; enable for autoscaling and observability.
Cluster identity
Section titled “Cluster identity”- Workload Identity (GCP) or IRSA (AWS) bound to a service account named
sie-serverin the SIE release namespace. This is how worker pods read the model cache bucket (GCS or S3) without static credentials. The Terraform examples create and bind this for you.
Network egress
Section titled “Network egress”The cluster must reach:
ghcr.iofor chart images (sie-gateway,sie-server,sie-config) and the OCI chart itselfhuggingface.cofor model weights on first request (unless you pre-populate a cluster cache bucket viasie-admin cache weights sync)
Air-gapped environments must mirror both registries and configure workers.common.clusterCache.url to a pre-populated S3 or GCS bucket.
Tokens and secrets
Section titled “Tokens and secrets”HF_TOKENrequired for gated HuggingFace models (e.g.google/embeddinggemma-300m,naver/splade-v3). Optional for theBAAI/bge-m3smoke test.
For cloud-account-level requirements (GCP project, GPU quotas, IAM roles, API enablement), see the Prerequisites section on the GCP or AWS page.
Frequently Asked Questions
Section titled “Frequently Asked Questions”Can SIE run without a GPU? Yes. SIE runs on CPU and works well for development and low-traffic workloads. For production inference at scale, a GPU is strongly recommended, especially for batch encoding. See Hardware and Capacity.
How do I monitor a SIE deployment? SIE exposes Prometheus metrics and structured logs. See Monitoring and Observability for dashboards, alerting, and log configuration.
How do I tune SIE for better performance? The main levers are batch size, worker concurrency, and model preloading. See Performance Tuning for a step-by-step guide.
How do I upgrade SIE without downtime? See the Upgrade Runbook for rolling upgrade procedures on both Docker and Kubernetes.
Is there a managed cloud option? Superlinked offers managed SIE deployments for teams that do not want to manage infrastructure themselves. Contact us to learn more.