Open source inference
for agents
Build agents with small open models
One cluster to power your agent
Browse all modelsEmbed, match, and rerank to retrieve the right context.
PDFs, Office files, and scans become clean markdown.
Schema-valid JSON, extracted or generated.
A safety verdict with a probability you threshold.
Plan steps and call tools with an open LLM, streaming included.
Start building with examples
Run it yourself, or let us run it for you
Self-host
Docker on a laptop, Helm on your cluster, Terraform on AWS, GCP, and Azure. Air-gapped installs run from mirrored model snapshots.
Read the quickstartManaged
We carry the GPU quotas, the autoscaling, the model tuning, and the upgrades. Zero data retention: requests process in flight, nothing persists.
Free hosted capacity for selected projects.
Agent plugin
Drops into your agent stack and routes document work (parsing, extraction, summarization, question answering, image description) off your frontier-model bill. It can also redact sensitive data before it reaches a third-party model.
How does SIE work?
Engine docsSIE is a Kubernetes inference cluster: a stateless gateway publishes work to one queue, and worker pods pull it, form full batches, and share GPUs across many models.
Worker pools, model stacking & auto-scaling
Cluster-wide queue maximizes GPU utilization
A router commits each request blind, so queues run uneven and mixed sizes pack poorly per worker.
Each SIE server sidecar pulls from one pool queue, so mixed sizes pack cleanly into a full batch.
SIE is the only full-stack inference solution for agents
| Agent workloads | Install & ops | Model updates | Cloud setup | |
|---|---|---|---|---|
| SIE | ✓✓✓ LLM, OCR, vision, embeddings, rerank, policy | ✓✓✓ Helm chart; KEDA scale-from-zero | ✓✓✓ profile hot reload, no restarts | ✓✓✓ Terraform for AWS, GCP, Azure |
| NVIDIA Dynamo + Triton | ✓ LLM plus general model serving | ✓✓✓ operator, Helm, CRDs | ✓ model repository | ✓ cloud guides |
| llm-d + vLLM | ✓ LLM serving through vLLM | ✓✓✓ K8s scheduler + routing | ✓ K8s rollout | ✓ reference deploys |
| llama-swap | ✓ LLMs via OpenAI-compatible servers | ✗ single-host proxy, no orchestration | ✓✓✓ swaps model servers on demand | ✗ write your own IaC |
| Runtimes / backends | Wraps SGLang · vLLM · TensorRT-LLM · TEI SIE wraps whichever proves best for a given model. Native PyTorch (Python) · Candle (Rust) Its own backends for maximum performance, and models others can't serve. | |||