Skip to content
Why did we open-source our inference engine? Read the post

How to deploy SIE

SIE deploys as a single container with no external dependencies. There are two deployment paths: Docker for simplicity, and Kubernetes for scaling and high availability. Both use the same image. There is no separate production build.


Use Docker if:

  • You are running on a single server or VM
  • You are in development or running a low-traffic service
  • You want the simplest possible setup

Use Kubernetes if:

  • You need horizontal scaling or autoscaling to zero
  • You need high availability across multiple nodes
  • You are deploying on GCP or AWS with GPU node pools
DockerKubernetes
Setup timeMinutesHours
ScalingManualAutomatic
High availabilityNoYes
Scale-to-zeroNoYes
Best forDev, single-serverProduction, high traffic

See Kubernetes on GCP and Kubernetes on AWS for cloud-specific guides.


The fastest way to run SIE is a single docker run:

# CPU only
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default
# With GPU (recommended)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

The server starts on port 8080. Models load on first request with no pre-configuration needed.

# Persistent model cache (avoids re-downloading on restart)
docker run --gpus all \
-p 8080:8080 \
-v ~/.cache/sie:/root/.cache/sie \
ghcr.io/superlinked/sie-server:default
# Custom port
docker run --gpus all -p 3000:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
# Specific models only (faster startup)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default \
sie-server serve -m BAAI/bge-m3,BAAI/bge-reranker-v2-m3
# Persistent model cache (skip re-downloads)
docker run --gpus all -p 8080:8080 \
-v ~/.cache/huggingface:/app/.cache/huggingface \
ghcr.io/superlinked/sie-server:latest-cuda12-default
# Different bundle (e.g. SGLang backend for large LLM embeddings)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-sglang

See the full Docker deployment guide.


ComponentMinimumRecommended
CPU4 cores8+ cores
RAM8GB16GB+
GPUOptionalAny NVIDIA with 16GB+ VRAM
Disk20GB100GB+ for model cache
GPUVRAMBest for
T416GBDevelopment, light production
L424GBStandard production (recommended starting point)
A100 40GB40GBHigh-throughput or large model serving
A100 80GB80GB7B+ parameter models

See Hardware and Capacity for full sizing guidance.


Move from Docker to Kubernetes when you need:

  • Autoscaling to handle traffic spikes by spinning up additional workers
  • Scale-to-zero to save costs by scaling down during idle periods
  • High availability with multiple replicas to survive node failures
  • Multi-region deployment to serve users in different geographies

Note: Kubernetes clusters with scale-to-zero have cold start times of 5 to 7 minutes. Use wait_for_capacity=True in the Python SDK (or waitForCapacity: true in TypeScript) to handle this gracefully. See Scale-from-Zero and Autoscaling.


Can SIE run without a GPU? Yes. SIE runs on CPU and works well for development and low-traffic workloads. For production inference at scale, a GPU is strongly recommended, especially for batch encoding. See Hardware and Capacity.

How do I monitor a SIE deployment? SIE exposes Prometheus metrics and structured logs. See Monitoring and Observability for dashboards, alerting, and log configuration.

How do I tune SIE for better performance? The main levers are batch size, worker concurrency, and model preloading. See Performance Tuning for a step-by-step guide.

How do I upgrade SIE without downtime? See the Upgrade Runbook for rolling upgrade procedures on both Docker and Kubernetes.

Is there a managed cloud option? Superlinked offers managed SIE deployments for teams that do not want to manage infrastructure themselves. Contact us to learn more.

Contact us

Tell us about your use case and we'll get back to you shortly.