Why did we open-source our inference engine? Read the post

Build agents with small open models

Cheaper
50x
gte-multilingualvstext-embedding-3
all-in on AWS EKS (L4)
Faster
2.7x
bge-m3vsCohere rerank-3.5
MTEB AskUbuntu
Smart
96%
Qwen3.6-27BvsGPT-5.1
AA Intelligence Index
Yours
100%
in your cloud, or air-gapped

One cluster to power your agent

Browse all models
contract_agent.py
from agents import Agent, Runner, function_tool, FileSearchTool, input_guardrail
# one line sends every model call to your own cluster
set_default_openai_client(AsyncOpenAI(base_url="https://sie.internal/v1"))
class Contract(BaseModel):
party: str; renewal_date: str
@function_tool
async def parse_document(url): # PDF / scan to markdown 2
return await sie_ocr(url)
@input_guardrail 4
async def safety(ctx, agent, text): ...
agent = Agent(
tools=[FileSearchTool(max_num_results=3), parse_document], 1
output_type=Contract, input_guardrails=[safety], 3
)
result = await Runner.run(agent, "Review the Acme MSA renewal.") 5
1
Search

Embed, match, and rerank to retrieve the right context.

bge-m3splade-v3colbertv2qwen3-reranker
2
Document to markdown

PDFs, Office files, and scans become clean markdown.

glm-ocrminerupaddleocr-vldocling
3
Structured output

Schema-valid JSON, extracted or generated.

gliner2nuner-zeroqwen3.6-27b
4
Guard content

A safety verdict with a probability you threshold.

granite-guardian-2b
5
Run the agent loop

Plan steps and call tools with an open LLM, streaming included.

qwen3.6-27bqwen3.6-4b

Run it yourself, or let us run it for you

Always free

Self-host

Docker on a laptop, Helm on your cluster, Terraform on AWS, GCP, and Azure. Air-gapped installs run from mirrored model snapshots.

Read the quickstart
Upcoming

Managed

We carry the GPU quotas, the autoscaling, the model tuning, and the upgrades. Zero data retention: requests process in flight, nothing persists.

Free hosted capacity for selected projects.

Upcoming

Agent plugin

Drops into your agent stack and routes document work (parsing, extraction, summarization, question answering, image description) off your frontier-model bill. It can also redact sensitive data before it reaches a third-party model.

How does SIE work?

Engine docs

SIE is a Kubernetes inference cluster: a stateless gateway publishes work to one queue, and worker pods pull it, form full batches, and share GPUs across many models.

Worker pools, model stacking & auto-scaling

Kubernetes cluster Amazon EKS · Google GKE · Azure AKS
Gateway routes each call to its pool
agent-realtime 3 × NVIDIA L4
PyTorch bge-reranker-v2-m3
Candle embeddinggemma-300m
SGLang Qwen3-0.6B
nightly-pipeline 4 × NVIDIA H100
PyTorch PaddleOCR-VL
Candle granite-embedding-r2
SGLang gte-Qwen2-7B
eval-suite 2 × NVIDIA RTX PRO 6000
PyTorch colbertv2.0
Candle jina-reranker-v2
SGLang e5-mistral-7b

Cluster-wide queue maximizes GPU utilization

ROUTE-THEN-BATCH
vLLM · SGLang · TEI · llm-d · Dynamo
Other solutions with worker-local queues

A router commits each request blind, so queues run uneven and mixed sizes pack poorly per worker.

51%
GPU efficiency
POOL-THEN-BATCH
SIE pool work stream
SIE cluster-wide queue drives GPU utilization

Each SIE server sidecar pulls from one pool queue, so mixed sizes pack cleanly into a full batch.

89%
GPU efficiency

SIE is the only full-stack inference solution for agents

Agent workloadsInstall & opsModel updatesCloud setup
SIE ✓✓✓ LLM, OCR, vision, embeddings, rerank, policy ✓✓✓ Helm chart; KEDA scale-from-zero ✓✓✓ profile hot reload, no restarts ✓✓✓ Terraform for AWS, GCP, Azure
NVIDIA Dynamo + Triton LLM plus general model serving ✓✓✓ operator, Helm, CRDs model repository cloud guides
llm-d + vLLM LLM serving through vLLM ✓✓✓ K8s scheduler + routing K8s rollout reference deploys
llama-swap LLMs via OpenAI-compatible servers single-host proxy, no orchestration ✓✓✓ swaps model servers on demand write your own IaC
Runtimes / backends
Wraps SGLang · vLLM · TensorRT-LLM · TEI SIE wraps whichever proves best for a given model.
Native PyTorch (Python) · Candle (Rust) Its own backends for maximum performance, and models others can't serve.
✓✓✓ bundled support · ✓ narrow or component-level · ✗ you own that layer

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.