Docker

Quick Start

# CPU only
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default

# With GPU (recommended for production)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

Verify the server is running:

curl http://localhost:8080/healthz
# {"status":"ok"}

Image Tags

Images follow the format {version}-{platform}-{bundle}. The floating latest prefix points at the most recent release.

By Platform

Tag	Base	Use Case
`latest-cuda12-default`	CUDA 12	Production with modern NVIDIA GPUs
`latest-cpu-default`	Ubuntu 22.04	Development, ARM64, no GPU

Pinned releases use the version prefix, for example v0.2.0-cuda12-default.

By Bundle

Each platform publishes the bundles below. See Bundles for the models each one includes.

Tag	Purpose
`latest-cuda12-default`	All standard models: dense, sparse, ColBERT, vision, extraction, cross-encoders
`latest-cuda12-sglang`	Large LLM embeddings (4B+ params) served through SGLang

CPU and CUDA 12 images follow the same pattern: latest-cpu-default, latest-cpu-sglang, latest-cuda12-default, etc.

GPU Configuration

Single GPU

docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

Specific GPU

# Use GPU 0 only
docker run --gpus '"device=0"' -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

# Use GPUs 0 and 1
docker run --gpus '"device=0,1"' -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

NVIDIA Container Toolkit

The --gpus flag requires NVIDIA Container Toolkit. Install it first:

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Environment Variables

Configure the server with environment variables. All variables use the SIE_ prefix.

Core Settings

Variable	Default	Description
`SIE_DEVICE`	`auto`	Compute device: `auto` (detect GPU), `cpu`, `cuda`, `cuda:0`, `mps`
`SIE_MODELS_DIR`	`/app/models`	Path to model configs
`SIE_MODEL_FILTER`	(all)	Comma-separated list of models to load

Batching

Variable	Default	Description
`SIE_MAX_BATCH_REQUESTS`	`64`	Maximum requests per batch
`SIE_MAX_BATCH_WAIT_MS`	`10`	Max wait time for batch to fill
`SIE_MAX_CONCURRENT_REQUESTS`	`512`	Queue size limit

Memory

Variable	Default	Description
`SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT`	`85`	VRAM percent that triggers LRU eviction

Observability

Variable	Default	Description
`SIE_LOG_JSON`	`false`	Use JSON log format
`SIE_TRACING_ENABLED`	`false`	Enable OpenTelemetry tracing
`SIE_GPU_TYPE`	(auto)	Override GPU type for metrics

Example

docker run --gpus all -p 8080:8080 \
  -e SIE_DEVICE=cuda \
  -e SIE_MAX_BATCH_REQUESTS=128 \
  -e SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=85 \
  -e SIE_LOG_JSON=true \
  ghcr.io/superlinked/sie-server:latest-cuda12-default

Volume Mounts

HuggingFace Cache

Mount a persistent volume for model weights. This avoids re-downloading on restarts.

docker run --gpus all -p 8080:8080 \
  -v ~/.cache/huggingface:/app/.cache/huggingface \
  ghcr.io/superlinked/sie-server:latest-cuda12-default

The container uses HF_HOME=/app/.cache/huggingface by default.

Custom Model Configs

Add your own model configs by mounting a directory:

docker run --gpus all -p 8080:8080 \
  -v /path/to/my-models:/app/models \
  ghcr.io/superlinked/sie-server:latest-cuda12-default

Read-Only Root Filesystem

For security-hardened deployments, use read-only root with explicit writable mounts:

docker run --gpus all -p 8080:8080 \
  --read-only \
  -v hf-cache:/app/.cache/huggingface \
  --tmpfs /tmp:size=1G \
  ghcr.io/superlinked/sie-server:latest-cuda12-default

Docker Compose

Single Service

# docker-compose.yml
services:
  sie:
    image: ghcr.io/superlinked/sie-server:latest-cuda12-default
    ports:
      - "8080:8080"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - hf-cache:/app/.cache/huggingface
    environment:
      - SIE_DEVICE=cuda
      - SIE_MAX_BATCH_REQUESTS=128
    healthcheck:
      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/healthz')"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

volumes:
  hf-cache:

Multi-Bundle Setup

Run multiple bundles side by side when you need the SGLang backend alongside the default models:

# docker-compose.yml
services:
  sie-default:
    image: ghcr.io/superlinked/sie-server:latest-cuda12-default
    ports:
      - "8080:8080"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    volumes:
      - hf-cache:/app/.cache/huggingface
    environment:
      - SIE_DEVICE=cuda

  sie-sglang:
    image: ghcr.io/superlinked/sie-server:latest-cuda12-sglang
    ports:
      - "8081:8080"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    volumes:
      - hf-cache:/app/.cache/huggingface
    environment:
      - SIE_DEVICE=cuda

volumes:
  hf-cache:

Start with:

docker compose up -d

What’s Next

Bundles - dependency isolation for conflicting models
Kubernetes in GCP - production deployment with Helm
Kubernetes in AWS - EKS deployment with Terraform
Troubleshooting - common issues and solutions