Skip to content
Why did we open-source our inference engine? Read the post

Docker

# CPU only
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default
# With GPU (recommended for production)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

Verify the server is running:

curl http://localhost:8080/healthz
# {"status":"ok"}

Images follow the format {version}-{platform}-{bundle}. The floating latest prefix points at the most recent release.

TagBaseUse Case
latest-cuda12-defaultCUDA 12Production with modern NVIDIA GPUs
latest-cuda11-defaultCUDA 11Older NVIDIA GPUs
latest-cpu-defaultUbuntu 22.04Development, ARM64, no GPU

Pinned releases use the version prefix, for example v0.2.0-cuda12-default.

Each platform publishes the bundles below. See Bundles for the models each one includes.

TagPurpose
latest-cuda12-defaultAll standard models: dense, sparse, ColBERT, vision, extraction, cross-encoders
latest-cuda12-sglangLarge LLM embeddings (4B+ params) served through SGLang

CPU and CUDA 11 images follow the same pattern: latest-cpu-default, latest-cpu-sglang, latest-cuda11-default, etc.


docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
# Use GPU 0 only
docker run --gpus '"device=0"' -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
# Use GPUs 0 and 1
docker run --gpus '"device=0,1"' -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

The --gpus flag requires NVIDIA Container Toolkit. Install it first:

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Configure the server with environment variables. All variables use the SIE_ prefix.

VariableDefaultDescription
SIE_DEVICEautoCompute device: auto (detect GPU), cpu, cuda, cuda:0, mps
SIE_MODELS_DIR/app/modelsPath to model configs
SIE_MODEL_FILTER(all)Comma-separated list of models to load
VariableDefaultDescription
SIE_MAX_BATCH_REQUESTS64Maximum requests per batch
SIE_MAX_BATCH_WAIT_MS10Max wait time for batch to fill
SIE_MAX_CONCURRENT_REQUESTS512Queue size limit
VariableDefaultDescription
SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT85VRAM percent that triggers LRU eviction
VariableDefaultDescription
SIE_LOG_JSONfalseUse JSON log format
SIE_TRACING_ENABLEDfalseEnable OpenTelemetry tracing
SIE_GPU_TYPE(auto)Override GPU type for metrics
docker run --gpus all -p 8080:8080 \
-e SIE_DEVICE=cuda \
-e SIE_MAX_BATCH_REQUESTS=128 \
-e SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=85 \
-e SIE_LOG_JSON=true \
ghcr.io/superlinked/sie-server:latest-cuda12-default

Mount a persistent volume for model weights. This avoids re-downloading on restarts.

docker run --gpus all -p 8080:8080 \
-v ~/.cache/huggingface:/app/.cache/huggingface \
ghcr.io/superlinked/sie-server:latest-cuda12-default

The container uses HF_HOME=/app/.cache/huggingface by default.

Add your own model configs by mounting a directory:

docker run --gpus all -p 8080:8080 \
-v /path/to/my-models:/app/models \
ghcr.io/superlinked/sie-server:latest-cuda12-default

For security-hardened deployments, use read-only root with explicit writable mounts:

docker run --gpus all -p 8080:8080 \
--read-only \
-v hf-cache:/app/.cache/huggingface \
--tmpfs /tmp:size=1G \
ghcr.io/superlinked/sie-server:latest-cuda12-default

# docker-compose.yml
services:
sie:
image: ghcr.io/superlinked/sie-server:latest-cuda12-default
ports:
- "8080:8080"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- hf-cache:/app/.cache/huggingface
environment:
- SIE_DEVICE=cuda
- SIE_MAX_BATCH_REQUESTS=128
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/healthz')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
hf-cache:

Run multiple bundles side by side when you need the SGLang backend alongside the default models:

# docker-compose.yml
services:
sie-default:
image: ghcr.io/superlinked/sie-server:latest-cuda12-default
ports:
- "8080:8080"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
volumes:
- hf-cache:/app/.cache/huggingface
environment:
- SIE_DEVICE=cuda
sie-sglang:
image: ghcr.io/superlinked/sie-server:latest-cuda12-sglang
ports:
- "8081:8080"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
volumes:
- hf-cache:/app/.cache/huggingface
environment:
- SIE_DEVICE=cuda
volumes:
hf-cache:

Start with:

docker compose up -d

Contact us

Tell us about your use case and we'll get back to you shortly.