---
title: Self-hosted document processing for AI agents, with SIE
description: Your agents read PDFs, extract fields, embed chunks, and rerank context. SIE runs all of that document inference on your own GPU, so per-token spend stops scaling with agent usage and customer documents never leave your cloud.
canonical_url: https://superlinked.com/blog/self-hosted-document-processing-for-agents
last_updated: 2026-06-10
---

Most agent work is document work. Before an agent can answer, route a ticket, or fill a form, something has to read the PDF, parse the layout, pull the fields, embed the chunks, and rerank the passages that reach the model. Today most teams rent that inference from an outside provider, one token at a time. As agent usage grows, that choice quietly turns into a margin problem and a data-residency problem at the same time.

<BlogSieCta />

SIE (the Superlinked Inference Engine) is an open-source, Apache 2.0 inference server that runs the document side of your agent stack on your own infrastructure. OCR, parsing, structured and entity extraction, embeddings, and reranking all come from one cluster you install in your cloud, behind three primitives: `encode`, `score`, and `extract`. This page ties together every document-processing capability, the models behind each one, the runnable examples, and the source.

**In short:**

- **Cut per-token spend as agent usage grows.** Move high-volume document inference off metered APIs and onto GPUs you already pay for.
- **Keep prompts and documents inside your cloud.** Nothing is sent to a third-party endpoint, which is what compliance teams actually ask about.
- **Run every document tool from one cluster.** OCR, extraction, embeddings, and reranking share a single deployment instead of four vendor integrations.

[Star SIE on GitHub](https://github.com/superlinked/sie) · [Read the OCR docs](/docs/extract/ocr) · [Browse the model catalog](/models)

## The problem: rented inference for agent document work

When your platform calls a managed API for every page it reads and every chunk it ranks, you inherit three problems that get worse as agents do more.

**Cost.** Token pricing scales directly with usage. As agentic features grow, AI-native companies can bleed margin to their inference vendor on work that does not need a frontier model. Intercom reported cutting roughly $250K per month by replacing GPT with a fine-tuned 14B Qwen model for a single Fin AI pipeline task (Fergal Reid, Intercom Chief AI Officer, on the Chain of Thought podcast, 2026).

**Control.** Customers increasingly want control over the infrastructure their AI runs on, and sending prompts and document context to a third-party endpoint gives that up. On 30 January 2025 the Italian data protection authority blocked DeepSeek from processing Italian users' personal data after an insufficient privacy response (Italian Garante decision; Bird & Bird summary, January 2025). Document pipelines move exactly the kind of personal and contractual data that triggers those decisions.

**Portability.** Every customer cloud has its own compliance rules, and platforms that sell into enterprises end up maintaining bespoke deployments. About 70% of enterprises now run hybrid, averaging 2.4 public clouds (Flexera 2025 State of the Cloud Report). An inference layer that only lives in one provider's API does not travel with you.

Self-hosting the small, task-specific models that do this work is what wins all three back. The models are ready: there is now a capable open-source model for nearly any document task, with roughly 100,000 models uploaded to Hugging Face every month, and small open weights that fit in 96GB of VRAM or less now handle real agent workloads.

## The cost angle: stop paying per token for work a small model does better

Document inference is high volume and mostly repetitive: the same OCR, the same embedding model, the same reranker, called millions of times. That is the worst possible shape for per-token pricing and the best possible shape for a GPU you control.

SIE is built to raise GPU utilization without forcing every model onto its own dedicated server:

- **Models stay resident on one GPU** when memory allows, and idle models are evicted under pressure with LRU loading, so one instance serves many document models without pre-loading everything.
- **Concurrent requests are batched before each GPU pass.** SIE pulls from one shared work queue so mixed request sizes pack into full batches, instead of each worker batching only its own local slice.
- **Full batches keep each GPU doing useful work, not just staying busy.** In Superlinked's benchmarks this batch-then-route design reaches about 89% GPU efficiency versus roughly 51% for the route-then-batch pattern, which works out to about 1.8x the throughput per GPU at the same latency, an 80% higher-throughput result for small-model workloads.

You also skip the per-model tuning that usually eats the savings. Stand-alone runtimes like vLLM, SGLang, and TEI make you hunt for the right flags for every model and GPU (`--max-num-batched-tokens`, `--gpu-memory-utilization`, `--dtype`, `--max-model-len`, and so on). A known failure mode: an embedding model can reserve most of a GPU's memory on load while actual utilization during embedding stays low, wasting the card. SIE ships a tuned config file with each of its 85+ models so the profile is already set.

The net effect: the predictable, high-frequency document inference that an agent platform runs constantly moves to fixed-cost hardware, and your inference bill stops tracking your usage curve.

## The security and control angle: documents never leave your cloud

Only self-hosted inference keeps document data inside the boundary your customers and auditors care about. With SIE, the recognition, parsing, extraction, embedding, and reranking all happen on the host running the cluster. Customer documents are never sent to an outside provider, which removes the entire class of objections a third-party endpoint creates.

That control extends to how you operate the models, without weakening the boundary:

- **Change model behavior without rebuilding the platform or sending prompts out.** Add models and profiles through the Config API or a GitOps workflow.
- **Hot-load LoRA domain adapters through named profiles**, so a compliance or legal variant is a config change, not a redeploy and not an external call.
- **Roll out safely.** The gateway waits for worker acknowledgement before routing traffic to a new model.
- **Run air-gapped.** Model-weight snapshots let SIE run from mirrored registries with no public network access, which is the deployment shape regulated customers tend to require.

Because the same cluster installs in your SaaS cloud (one cluster, many tenants) or inside a customer's cloud (one cluster, one customer), the security story is identical wherever the platform runs.

## How the pipeline maps to SIE primitives

A document pipeline is a chain, and every link is one of three SIE primitives:

```
Document (PDF / image / scan)
        │  extract  → OCR / parse → Markdown + JSON
        ├─ extract  → entities, tables, fields
        └─ encode   → chunk + embed → your vector DB
                                          │  score → rerank top-k
                                          ▼
                         structured data + ranked context → your agent
```

`extract` handles OCR, document parsing, and entity and structured extraction. `encode` produces the embeddings you search over. `score` reranks retrieved chunks before they reach the model. One server, one SDK, one deployment.

## Capabilities: six building blocks, one SDK call each

Every block below is the same `client.<primitive>(model_id, item)` call. Swap the model identifier to swap the architecture; SIE hot-loads the weights on first use.

### 1. OCR model serving

Serve four dedicated OCR models plus Florence-2 from one endpoint. They turn document images and PDF pages into Markdown with tables, headings, and reading order preserved, on CPU or GPU.

Models: `zai-org/GLM-OCR`, `lightonai/LightOnOCR-2-1B`, `PaddlePaddle/PaddleOCR-VL-1.5`, `docling`, and `microsoft/Florence-2-base` for flat text.

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Image to Markdown, layout preserved
result = client.extract(
    "zai-org/GLM-OCR",
    Item(images=[{"data": page_bytes, "format": "png"}]),
)
markdown = result["entities"][0]["text"]
```

Docs: [OCR](/docs/extract/ocr) · Models: [extract task](/models#task=extract) · Example: [document OCR](/docs/examples/document-ocr)

### 2. Structured extraction

Go straight from a document image to a typed JSON tree with end-to-end document models, or pull tables, formulas, and charts with PaddleOCR-VL task modes. No prompt engineering and no text intermediate.

Output: nested JSON in `result["data"]` for document models; Markdown tables for task modes.

```python
# End-to-end document model, image to JSON
result = client.extract(
    "naver-clova-ix/donut-base-finetuned-cord-v2",
    Item(images=[{"data": page_bytes, "format": "png"}]),
)
fields = result["data"]

# Table-structure mode on a single image
tables = client.extract(
    "PaddlePaddle/PaddleOCR-VL-1.5",
    Item(images=[{"data": table_img, "format": "png"}]),
    options={"task": "table"},
)
```

Docs: [Extract overview](/docs/extract) · [Task modes](/docs/extract/ocr#paddleocr-vl-15) · Models: [catalog](/models#task=extract)

### 3. Entity extraction

Run zero-shot named entity recognition over recognized text. Declare your own labels at query time with GLiNER, with no fine-tuning. Relations and classification share the same call and the same server.

Models: GLiNER family and NuNER Zero (entities), GLiREL (relations), plus classification.

```python
# Zero-shot NER, labels defined at query time
result = client.extract(
    "urchade/gliner_multi-v2.1",
    Item(text=recognized_markdown),
    labels=["merchant", "total", "date", "organization"],
)
for e in result["entities"]:
    print(e["label"], e["text"], e["score"])
```

Docs: [NER](/docs/extract) · [Relations & classification](/docs/extract/relations) · Models: [catalog](/models#task=extract)

### 4. Document parsing

Parse whole PDF, DOCX, or HTML files in one call with docling. The default profile uses embedded text for born-digital files; switch on the OCR profile for scanned pages. Output is text, Markdown, and a full document tree ready for chunking.

Output type: `Parsed Document` with `text`, `markdown`, and `document` fields.

```python
# Born-digital PDF: fast, no OCR
result = client.extract(
    "docling",
    Item(document={"data": pdf_bytes, "format": "pdf"}),
)
markdown = result["data"]["markdown"]

# Scanned PDF: turn on the OCR profile
scanned = client.extract(
    "docling",
    Item(document={"data": pdf_bytes, "format": "pdf"}),
    options={"profile": "ocr"},
)
```

Docs: [Docling](/docs/extract/ocr#docling-multi-page-documents) · Model: [docling](/models/docling)

### 5. Embeddings for document search

Chunk the parsed Markdown and embed it for semantic search and RAG. Dense, sparse, and multi-vector outputs come from the same models, so you can run hybrid retrieval over your document corpus.

Models: `Qwen/Qwen3-Embedding-0.6B`, `intfloat/multilingual-e5-large`, `google/embeddinggemma-300m`, `BAAI/bge-m3` (dense, sparse, multi-vector).

```python
# Embed document chunks for your vector DB
chunks = [Item(id=f"chunk-{i}", text=t) for i, t in enumerate(doc_chunks)]
vectors = client.encode("Qwen/Qwen3-Embedding-0.6B", chunks)

# Encode the query side asymmetrically
q = client.encode("Qwen/Qwen3-Embedding-0.6B", Item(text="indemnification cap"), is_query=True)
```

Docs: [Encode](/docs/encode) · [Multimodal](/docs/encode/multimodal) · Models: [catalog](/models)

### 6. Reranking for document retrieval

After first-stage retrieval, rerank the top candidates with a cross-encoder so the most relevant passages reach your model. This two-stage pattern lifts precision without reranking your whole corpus.

Models: `mixedbread-ai/mxbai-rerank-large-v2`, `jinaai/jina-reranker-v2-base-multilingual`, `BAAI/bge-reranker-v2-m3`, plus ColBERT multi-vector reranking.

```python
# Rerank retrieved chunks, keep the top 10
query = Item(text="termination clause liability")
result = client.score("mixedbread-ai/mxbai-rerank-large-v2", query, candidate_chunks)
top_ids = [entry["item_id"] for entry in result["scores"][:10]]
```

Docs: [Score](/docs/score) · [Reranker models](/docs/score/models) · Models: [catalog](/models#task=score)

## Document models, all in one cluster

Filter the full [catalog](/models) by the [extract](/models#task=extract) and [score](/models#task=score) tasks, or by output type. Every model below is served by the same SIE instance and loaded on demand.

| Model | Stage | Output | Best for |
| --- | --- | --- | --- |
| [zai-org/GLM-OCR](/models/zai-org-glm-ocr) | OCR | Text (Markdown) | High-quality multilingual page OCR (CogViT + GLM, bfloat16 only) |
| [lightonai/LightOnOCR-2-1B](/models/lightonai-lightonocr-2-1b) | OCR | Text (Markdown) | Larger recognition VLM (Pixtral encoder + Qwen3 decoder) |
| [PaddlePaddle/PaddleOCR-VL-1.5](/models/paddlepaddle-paddleocr-vl-1-5) | OCR | Text (Markdown) | 109 languages, smallest, table / formula / chart / seal modes |
| [docling](/models/docling) | Parse | Parsed Document | Multi-page PDF / DOCX / HTML, layout-aware, OCR opt-in |
| [urchade/gliner_multi-v2.1](/models/urchade-gliner_multi-v2-1) | Extract | Entities | Zero-shot multilingual NER, labels at query time |
| [urchade/gliner_multi_pii-v1](/models/urchade-gliner_multi_pii-v1) | Extract | Entities | PII detection and redaction over recognized text |
| [jackboyla/glirel-large-v0](/models/jackboyla-glirel-large-v0) | Extract | Relations | Zero-shot relation extraction between entities |
| [Qwen/Qwen3-Embedding-0.6B](/models/qwen-qwen3-embedding-0-6b) | Encode | Dense | Document chunk embeddings |
| [BAAI/bge-m3](/models/baai-bge-m3--encode) | Encode | Dense / Sparse / Multi-Vec | Dense, sparse, and multi-vector in one call |
| [mixedbread-ai/mxbai-rerank-large-v2](/models/mixedbread-ai-mxbai-rerank-large-v2) | Score | Score | Cross-encoder reranking of retrieved chunks |
| [BAAI/bge-reranker-v2-m3](/models/baai-bge-reranker-v2-m3) | Score | Score | Multilingual cross-encoder reranking |

OCR and document models are validated for correctness in CI. Quality and latency benchmarks for the OCR models are still in progress and will be published once the eval-matrix work lands, so the notes above describe input shape and feature differences rather than measured speed. See [Choosing a Model](/docs/choosing) for selection guidance and [Evals](/docs/evals) for how SIE measures quality.

## Where document work fits in the wider agent stack

Document processing is one lane of the agent inference your platform runs, and SIE routes all of those lanes from the same cluster: planning the next step, calling tools, embedding content, reranking context, parsing documents, and extracting image data, with code, SQL, and policy checks on the roadmap. Each job routes to the right model behind the same SDK.

That coverage is the difference from single-purpose runtimes. Tools like vLLM and SGLang serve LLM generation, and TEI serves embeddings and reranking, but each covers one slice and typically needs a restart to change models and hand-written cloud IaC to deploy.

| Stack | Job coverage | K8s install + ops | Model changes | Cloud setup |
| --- | --- | --- | --- | --- |
| **SIE** (agent-platform inference layer) | LLM, OCR, vision, embeddings, rerank, policy | Helm chart for Kubernetes | Profile hot reload | Terraform modules |
| vLLM | LLM generation; embeddings on supported models | Helm path available | Restart for base model | Write cloud IaC |
| SGLang | LLM and vision generation | K8s manifests / LWS | Restart for base model | SkyPilot option |
| TEI | Dense and sparse embeddings; rerank | Own orchestration | Restart for model | Write cloud IaC |
| Dynamo + Triton | LLM plus general model serving | Operator, Helm, CRDs | Model repository | Cloud guides |

For document pipelines specifically, that means OCR, parsing, extraction, embeddings, and reranking live in one place instead of being split across an LLM runtime, an embedding server, and a separate document-AI vendor.

## Deployment: the same layer in any cloud

SIE installs as a Kubernetes inference cluster inside the environment your platform already runs in. The same Docker image runs on a laptop and in production, with no separate production mode.

```bash
# Pull and serve on your own box
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default

# Or deploy the cluster to your VPC
helm install sie oci://ghcr.io/superlinked/charts/sie-cluster

# Use it from anywhere with the SDK
pip install sie-sdk
```

It is built to be portable and easy to operate: a Helm chart for EKS and GKE, Terraform modules for AWS, GCP, and Azure, and model-weight snapshots for air-gapped environments.

Docs: [Deployment](/docs/deployment) · [Air-gapped](/docs/deployment/offline)

## Runnable examples

- [Swap an OCR model with one identifier change](/docs/examples/document-ocr): a three-stage pipeline (recognition, structured extraction, zero-shot NER) where every stage is the same `extract` call. Runs locally with Docker or as a hosted Hugging Face Space.
- [Multimodal wine recommender with OCR](/docs/examples/wine-recommender): pairs preference-based retrieval and reranking with OCR-style label detection in one UI.
- [Private fine-tuned compliance RAG](/docs/examples/regulatory-intelligence-rag): a regulatory RAG pipeline that hot-loads a domain LoRA at request time and reranks plus prunes context in one pass, fully self-hosted.
- [Find the best retrieval strategy for your RAG](/docs/examples/benchmark): an evals-driven study on 1,854 SEC 10-K queries comparing how each document retrieval pipeline scored.

## Document processing resource hub

**Docs:** [OCR](/docs/extract/ocr) · [Extract overview](/docs/extract) · [Relations & classification](/docs/extract/relations) · [Vision tasks](/docs/extract/vision) · [Encode (embeddings)](/docs/encode) · [Score (reranking)](/docs/score)

**Examples:** [Document OCR](/docs/examples/document-ocr) · [Wine recommender](/docs/examples/wine-recommender) · [Compliance RAG](/docs/examples/regulatory-intelligence-rag) · [Retrieval benchmark](/docs/examples/benchmark) · [All examples](/docs/examples)

**Model catalog:** [Extract models](/models#task=extract) · [Score models](/models#task=score) · [docling](/models/docling) · [GLM-OCR](/models/zai-org-glm-ocr) · [Full catalog](/models)

**GitHub & SDK:** [superlinked/sie](https://github.com/superlinked/sie) · [document-ocr source](https://github.com/superlinked/sie/tree/main/examples/document-ocr) · [Hugging Face Space](https://huggingface.co/spaces/superlinked/document-ocr) · [Quickstart](/docs/quickstart) · [Python SDK reference](/docs/reference/sdk)

## FAQ

**How does self-hosting document inference cut cost for an agent platform?**
Document inference is high frequency and repetitive, which is the worst fit for per-token pricing. Moving it to GPUs you control turns a usage-linked bill into a fixed hardware cost. SIE keeps frequent models resident on one GPU, evicts idle ones with LRU loading, and batches concurrent requests before each GPU pass, which is how it reaches high GPU efficiency instead of paying for idle capacity.

**Can I run the whole pipeline air-gapped or on-premise?**
Yes. SIE is open source under Apache 2.0 and ships the same Docker image, Helm chart, and Terraform modules from laptop to Kubernetes, with model-weight snapshots for offline and air-gapped deployment. Documents never leave the host running the cluster.

**Can SIE convert PDFs and scanned documents to Markdown on my own hardware?**
Yes. SIE serves four dedicated OCR models (GLM-OCR, LightOnOCR-2-1B, PaddleOCR-VL-1.5, and docling) plus Florence-2 for flat-text OCR. They convert document images and PDFs to Markdown with tables and headings preserved, on CPU or GPU.

**What is the difference between OCR and document parsing in SIE?**
The recognition OCR models take a single image and return Markdown. docling parses a whole multi-page PDF, DOCX, or HTML file in one call, preserving layout and tables, and returns text, Markdown, and a full document tree. Use the recognition models per image and docling for complete files.

**Does SIE handle structured and entity extraction from documents?**
Yes. End-to-end document models emit JSON directly, PaddleOCR-VL task modes pull tables and formulas, and the extract primitive runs zero-shot NER with GLiNER, relation extraction with GLiREL, and text classification. You declare entity labels at query time without fine-tuning.

**How do embeddings and reranking fit into a document pipeline?**
After parsing, you chunk the Markdown and call `encode` to embed it for semantic search, then store the vectors in your database. At query time you retrieve a broad candidate set, then call `score` to rerank the top candidates with a cross-encoder so the most relevant passages reach your model.

## The takeaway

*Run the AI models that power your agents on your own terms.* [Get started on GitHub](https://github.com/superlinked/sie) or [read the docs](/docs).