Bring SIE up in a cluster with no public internet access. The worker pods normally pull model weights from HuggingFace and container images from GHCR; both of those need to come from inside your network instead.
This guide covers a typical air-gapped flow:
Snapshot model weights on a workstation that has internet access.
Mirror the snapshot to private S3-compatible storage reachable from the cluster.
Configure the chart to read weights from that store and skip HuggingFace.
Mirror the SIE container images to a private registry.
Verify first inference with no egress.
The same pattern works for “restricted egress” clusters that allow private object storage but block public HuggingFace.
The result is a directory in HuggingFace cache layout (./offline-weights/models--BAAI--bge-m3/snapshots/<sha>/...) that the chart can mount as HF_HUB_CACHE. The cache layout stores both blob files and snapshot symlinks, so the on-disk and mirrored sizes will be roughly 2× the model’s raw byte count — expected, not duplication.
Point the chart’s workers.common.clusterCache at the mirrored bucket. Workers will read weights from there instead of HuggingFace.
# values-offline.yaml
workers:
common:
clusterCache:
enabled: true
url: s3://sie-models-private/weights/# or gs:// for GCS
# Disable HuggingFace fallback so workers fail fast if the cache is incomplete
hfCache:
home: /models/huggingface
tokenSecret: ""
# Skip HF token wiring entirely in air-gapped clusters
hfToken:
create: false
For S3, the workers authenticate via IRSA (EKS) or static credentials supplied through extraEnv. For GCS, they use Workload Identity (GKE). For MinIO or other S3-compatibles, mount credentials via a secret and pass them through workers.common.extraEnv.
sie-server is only published with -{platform}-{bundle} suffixes — ghcr.io/superlinked/sie-server:v0.3.4 (plain) does not exist, and the chart’s worker template assembles the full tag from workers.common.image.tag + -${platform}-${bundle} at install time.
The chart also pulls NATS images via the bundled nats sub-chart (always installed). For a truly air-gapped cluster — one where the cluster host has no public egress, not just the sie namespace — these must be mirrored too:
Image
Source
nats:2.12.6-alpine
docker.io / nats.io
natsio/nats-server-config-reloader:0.21.1
docker.io
natsio/nats-box:0.19.3
docker.io
If you enable optional sub-charts (keda.install=true, kube-prometheus-stack.install=true, dcgm-exporter.install=true, loki.install=true, alloy.install=true), each pulls additional images. Run helm template oci://ghcr.io/superlinked/charts/sie-cluster --version 0.3.4 -f values-offline.yaml | grep -oE 'image:.*' | sort -u to extract the full set for your config.
Mirror the SIE images once:
TAG=v0.3.4
PLATFORM=cuda12# or `cpu` for a CPU-only worker pool
BUNDLE=default
# sie-server: platform/bundle suffix is required — there is no plain `:$TAG` tag
Note on architecture mismatch:docker pull on a host whose architecture differs from the cluster nodes’ (e.g. an arm64 Mac mirroring images for an amd64 EKS cluster) will silently pull the wrong platform unless you pass --platform, and the subsequent docker push will publish a multi-arch index with only the pulled platforms. Worker pods on a mismatched node arch will then fail with no match for platform in manifest. For arch-safe mirroring use crane (brew install crane) — it copies all platforms without going through the host’s container runtime:
Then point the chart at your registry. Note workers.common.image.tag stays as the plain version — the chart appends -{platform}-{bundle} automatically:
If you also mirrored the chart itself (recommended for fully air-gapped), pull it once with helm pull oci://ghcr.io/superlinked/charts/sie-cluster --version 0.3.4 and install from the local .tgz:
For a CPU worker pool (workers.common.platform: cpu, workers.pools.cpu.enabled: true, useful for local clusters or small offline deployments without a GPU):
python3-c"
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
The first request still pays the cold-start cost, but the weight load now comes from your private store rather than HuggingFace. CPU inference will be substantially slower than GPU for the same model.