Skip to content
Why did we open-source our inference engine? Read the post

Offline / Air-Gapped Deployment

Bring SIE up in a cluster with no public internet access. The worker pods normally pull model weights from HuggingFace and container images from GHCR; both of those need to come from inside your network instead.

This guide covers a typical air-gapped flow:

  1. Snapshot model weights on a workstation that has internet access.
  2. Mirror the snapshot to private S3-compatible storage reachable from the cluster.
  3. Configure the chart to read weights from that store and skip HuggingFace.
  4. Mirror the SIE container images to a private registry.
  5. Verify first inference with no egress.

The same pattern works for “restricted egress” clusters that allow private object storage but block public HuggingFace.

Use the hf CLI from the huggingface_hub package (huggingface-cli is the deprecated alias of the same tool and now prints a deprecation warning):

export HF_HUB_CACHE=./offline-weights
# One model
hf download BAAI/bge-m3 --cache-dir ./offline-weights
# A bundle's worth of models, repeated for each model in the bundle
hf download intfloat/e5-base-v2 --cache-dir ./offline-weights
hf download mixedbread-ai/mxbai-rerank-large-v1 --cache-dir ./offline-weights

The result is a directory in HuggingFace cache layout (./offline-weights/models--BAAI--bge-m3/snapshots/<sha>/...) that the chart can mount as HF_HUB_CACHE. The cache layout stores both blob files and snapshot symlinks, so the on-disk and mirrored sizes will be roughly 2× the model’s raw byte count — expected, not duplication.

Set HF_TOKEN before running for any gated models.

Push the snapshot to S3-compatible storage that the cluster can reach. AWS S3, GCS, MinIO, and Ceph all work; the chart treats them the same.

# AWS S3
aws s3 sync ./offline-weights s3://sie-models-private/weights/
# MinIO (in-cluster or on-prem)
mc mirror ./offline-weights minio/sie-models-private/weights/
# GCS
gsutil -m rsync -r ./offline-weights gs://sie-models-private/weights/

Whatever you choose, the URL handed to the chart in the next step must be reachable from worker pods.

Point the chart’s workers.common.clusterCache at the mirrored bucket. Workers will read weights from there instead of HuggingFace.

# values-offline.yaml
workers:
common:
clusterCache:
enabled: true
url: s3://sie-models-private/weights/ # or gs:// for GCS
# Disable HuggingFace fallback so workers fail fast if the cache is incomplete
hfCache:
home: /models/huggingface
tokenSecret: ""
# Skip HF token wiring entirely in air-gapped clusters
hfToken:
create: false

For S3, the workers authenticate via IRSA (EKS) or static credentials supplied through extraEnv. For GCS, they use Workload Identity (GKE). For MinIO or other S3-compatibles, mount credentials via a secret and pass them through workers.common.extraEnv.

The chart pulls these public SIE images from GHCR by default:

ImageWhere it’s setTag form
ghcr.io/superlinked/sie-serverworkers.common.image.repositoryvX.Y.Z-{platform}-{bundle} (e.g. v0.3.4-cuda12-default)
ghcr.io/superlinked/sie-gatewaygateway.image.repositoryplain vX.Y.Z
ghcr.io/superlinked/sie-configconfig.image.repositoryplain vX.Y.Z

sie-server is only published with -{platform}-{bundle} suffixes — ghcr.io/superlinked/sie-server:v0.3.4 (plain) does not exist, and the chart’s worker template assembles the full tag from workers.common.image.tag + -${platform}-${bundle} at install time.

The chart also pulls NATS images via the bundled nats sub-chart (always installed). For a truly air-gapped cluster — one where the cluster host has no public egress, not just the sie namespace — these must be mirrored too:

ImageSource
nats:2.12.6-alpinedocker.io / nats.io
natsio/nats-server-config-reloader:0.21.1docker.io
natsio/nats-box:0.19.3docker.io

If you enable optional sub-charts (keda.install=true, kube-prometheus-stack.install=true, dcgm-exporter.install=true, loki.install=true, alloy.install=true), each pulls additional images. Run helm template oci://ghcr.io/superlinked/charts/sie-cluster --version 0.3.4 -f values-offline.yaml | grep -oE 'image:.*' | sort -u to extract the full set for your config.

Mirror the SIE images once:

TAG=v0.3.4
PLATFORM=cuda12 # or `cpu` for a CPU-only worker pool
BUNDLE=default
# sie-server: platform/bundle suffix is required — there is no plain `:$TAG` tag
docker pull ghcr.io/superlinked/sie-server:${TAG}-${PLATFORM}-${BUNDLE}
docker tag ghcr.io/superlinked/sie-server:${TAG}-${PLATFORM}-${BUNDLE} \
private-registry.example.com/sie/sie-server:${TAG}-${PLATFORM}-${BUNDLE}
docker push private-registry.example.com/sie/sie-server:${TAG}-${PLATFORM}-${BUNDLE}
# sie-gateway and sie-config: plain version tag
for img in sie-gateway sie-config; do
docker pull ghcr.io/superlinked/$img:$TAG
docker tag ghcr.io/superlinked/$img:$TAG private-registry.example.com/sie/$img:$TAG
docker push private-registry.example.com/sie/$img:$TAG
done

Note on architecture mismatch: docker pull on a host whose architecture differs from the cluster nodes’ (e.g. an arm64 Mac mirroring images for an amd64 EKS cluster) will silently pull the wrong platform unless you pass --platform, and the subsequent docker push will publish a multi-arch index with only the pulled platforms. Worker pods on a mismatched node arch will then fail with no match for platform in manifest. For arch-safe mirroring use crane (brew install crane) — it copies all platforms without going through the host’s container runtime:

crane copy ghcr.io/superlinked/sie-server:${TAG}-${PLATFORM}-${BUNDLE} \
private-registry.example.com/sie/sie-server:${TAG}-${PLATFORM}-${BUNDLE}

Then point the chart at your registry. Note workers.common.image.tag stays as the plain version — the chart appends -{platform}-{bundle} automatically:

# values-offline.yaml (continued)
gateway:
image:
repository: private-registry.example.com/sie/sie-gateway
tag: v0.3.4
config:
image:
repository: private-registry.example.com/sie/sie-config
tag: v0.3.4
workers:
common:
image:
repository: private-registry.example.com/sie/sie-server
tag: v0.3.4 # chart appends -${platform}-${bundle} at install time
platform: cuda12 # or "cpu"
bundle: default
global:
imagePullSecrets:
- name: regcred

If your registry needs auth, create the regcred Docker secret in the sie namespace before installing the chart:

kubectl create secret docker-registry regcred \
--docker-server=private-registry.example.com \
--docker-username=... \
--docker-password=... \
-n sie

Install the chart with the offline values, no internet egress required:

helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.3.4 \
-f values-offline.yaml \
-n sie --create-namespace

If you also mirrored the chart itself (recommended for fully air-gapped), pull it once with helm pull oci://ghcr.io/superlinked/charts/sie-cluster --version 0.3.4 and install from the local .tgz:

helm pull oci://ghcr.io/superlinked/charts/sie-cluster --version 0.3.4
# Move sie-cluster-0.3.4.tgz onto the air-gapped workstation, then:
helm upgrade --install sie ./sie-cluster-0.3.4.tgz \
-f values-offline.yaml \
-n sie --create-namespace

Verify first inference. Install the SDK and run the GPU or CPU smoke test depending on your worker pool:

kubectl -n sie port-forward svc/sie-sie-cluster-gateway 8080:8080 &
# Install the Python SDK (requires Python 3.12 — see the SDK README for newer/older Python notes)
pip install sie-sdk

For a GPU worker pool (workers.common.platform: cuda12, workers.pools.l4.enabled: true):

python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
gpu='l4', wait_for_capacity=True, provision_timeout_s=600)
print(result['dense'].shape) # (1024,)
"

For a CPU worker pool (workers.common.platform: cpu, workers.pools.cpu.enabled: true, useful for local clusters or small offline deployments without a GPU):

python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
gpu='cpu', wait_for_capacity=True, provision_timeout_s=600)
print(result['dense'].shape) # (1024,)
"

The first request still pays the cold-start cost, but the weight load now comes from your private store rather than HuggingFace. CPU inference will be substantially slower than GPU for the same model.

SymptomLikely cause
Worker pod stuck in Init with 403 Forbidden from S3/GCSIRSA/Workload Identity missing the bucket-read permission
ImagePullBackOff on a worker podRegistry credentials missing, or imagePullSecrets not wired
Worker logs show OSError: Couldn't reach huggingface.coclusterCache URL typo or bucket missing the requested model
Chart install hangs on dependency downloadSub-charts (KEDA, kube-prometheus-stack, DCGM) trying to fetch from public Artifact Hub. Use helm pull with --untar and install the local copy.

Contact us

Tell us about your use case and we'll get back to you shortly.