Open source inference
for agents

Open models end to end, no frontier-lab dependency

Prompts and documents never leave your cloud

Runs on AWS, GCP, Azure, or air-gapped

“Chroma makes context engineering simple, and SIE adds instruction-following rerankers and relationship extractors.” Jeff Huber, Chroma “LanceDB centralizes multi-modal datasets, and with SIE you self-host inference for every transformation.” Chang She, LanceDB “Modern search composes the best indexing, scoring, and ranking models. With SIE you self-host them all.” Andre Zayarni, Qdrant “Weaviate’s Query Agent unlocks natural-language search, and SIE pre-processes your query and data for lower latency.” Bob Van Luijt, Weaviate

Build agents with small open models

Cheaper

50x

gte-multilingualvstext-embedding-3

all-in on AWS EKS (L4)

Faster

2.7x

bge-m3vsCohere rerank-3.5

MTEB AskUbuntu

Smart

96%

Qwen3.6-27BvsGPT-5.1

AA Intelligence Index

Yours

100%

in your cloud, or air-gapped

Apache 2.0 · SOC2 Type 2 · 112 models Deploy

Integrations

One cluster to power your agent

Browse all models

contract_agent.py

from agents import Agent, Runner, function_tool, FileSearchTool, input_guardrail

# one line sends every model call to your own cluster

set_default_openai_client(AsyncOpenAI(base_url="https://sie.internal/v1"))

class Contract(BaseModel):

party: str; renewal_date: str

@function_tool

async def parse_document(url): # PDF / scan to markdown 2

return await sie_ocr(url)

@input_guardrail 4

async def safety(ctx, agent, text): ...

agent = Agent(

tools=[FileSearchTool(max_num_results=3), parse_document], 1

output_type=Contract, input_guardrails=[safety], 3

)

result = await Runner.run(agent, "Review the Acme MSA renewal.") 5

Embed, match, and rerank to retrieve the right context.

bge-m3splade-v3colbertv2qwen3-reranker

Document to markdown

PDFs, Office files, and scans become clean markdown.

glm-ocrminerupaddleocr-vldocling

Structured output

Schema-valid JSON, extracted or generated.

gliner2nuner-zeroqwen3.6-27b

Guard content

A safety verdict with a probability you threshold.

granite-guardian-2b

Run the agent loop

Plan steps and call tools with an open LLM, streaming included.

qwen3.6-27bqwen3.6-4b

Start building with examples

Build a multimodal wine recommender with OCR

Self-hosted product search in 5 min

Private fine-tuned compliance RAG

Find the best retrieval strategy for your RAG

Find SOTA embedding models by MTEB task

Build a multi-modal product classifier with embeddings

Run it yourself, or let us run it for you

Always free

Self-host

Docker on a laptop, Helm on your cluster, Terraform on AWS, GCP, and Azure. Air-gapped installs run from mirrored model snapshots.

Read the quickstart

Upcoming

Managed

We carry the GPU quotas, the autoscaling, the model tuning, and the upgrades. Zero data retention: requests process in flight, nothing persists.

Free hosted capacity for selected projects.

Upcoming

Agent plugin

Drops into your agent stack and routes document work (parsing, extraction, summarization, question answering, image description) off your frontier-model bill. It can also redact sensitive data before it reaches a third-party model.

How does SIE work?

Engine docs

SIE is a Kubernetes inference cluster: a stateless gateway publishes work to one queue, and worker pods pull it, form full batches, and share GPUs across many models.

Worker pools, model stacking & auto-scaling

Kubernetes cluster Amazon EKS · Google GKE · Azure AKS

Gateway routes each call to its pool

agent-realtime 3 × NVIDIA L4

PyTorch bge-reranker-v2-m3

Candle embeddinggemma-300m

SGLang Qwen3-0.6B

nightly-pipeline 4 × NVIDIA H100

PyTorch PaddleOCR-VL

Candle granite-embedding-r2

SGLang gte-Qwen2-7B

eval-suite 2 × NVIDIA RTX PRO 6000

PyTorch colbertv2.0

Candle jina-reranker-v2

SGLang e5-mistral-7b

Cluster-wide queue maximizes GPU utilization

ROUTE-THEN-BATCH

vLLM · SGLang · TEI · llm-d · Dynamo

Other solutions with worker-local queues

A router commits each request blind, so queues run uneven and mixed sizes pack poorly per worker.

51%

GPU efficiency

POOL-THEN-BATCH

SIE pool work stream

SIE cluster-wide queue drives GPU utilization

Each SIE server sidecar pulls from one pool queue, so mixed sizes pack cleanly into a full batch.

89%

GPU efficiency

SIE is the only full-stack inference solution for agents

	Agent workloads	Install & ops	Model updates	Cloud setup
SIE	✓✓✓ LLM, OCR, vision, embeddings, rerank, policy	✓✓✓ Helm chart; KEDA scale-from-zero	✓✓✓ profile hot reload, no restarts	✓✓✓ Terraform for AWS, GCP, Azure
NVIDIA Dynamo + Triton	✓ LLM plus general model serving	✓✓✓ operator, Helm, CRDs	✓ model repository	✓ cloud guides
llm-d + vLLM	✓ LLM serving through vLLM	✓✓✓ K8s scheduler + routing	✓ K8s rollout	✓ reference deploys
llama-swap	✓ LLMs via OpenAI-compatible servers	✗ single-host proxy, no orchestration	✓✓✓ swaps model servers on demand	✗ write your own IaC
Runtimes / backends	Wraps SGLang · vLLM · TensorRT-LLM · TEI SIE wraps whichever proves best for a given model. Native PyTorch (Python) · Candle (Rust) Its own backends for maximum performance, and models others can't serve.