Inference

What is Self-Hosted Inference?

Self-hosted inference is the practice of running AI model inference on your own infrastructure, whether your own cloud account (AWS, GCP, Azure) or on-premises hardware, rather than sending requests to a third-party managed API. You control the hardware, the models, the configuration, and crucially, where your data goes.

Why does self-hosted inference matter?

Managed model APIs (OpenAI, Cohere, Voyage AI, etc.) are convenient for prototyping, but they introduce three problems at production scale:

1. Cost

Managed APIs charge per token. For embedding workloads, where you may encode millions of documents regularly, per-token pricing becomes the dominant infrastructure cost. Self-hosting on your own GPUs can reduce this by up to 50x.

2. Data privacy

Every request to a managed API sends your data to a third party’s servers. For regulated industries (legal, healthcare, finance, government), this is often a compliance blocker. Self-hosted inference keeps data entirely within your own cloud account.

3. Model control

Managed APIs offer a fixed menu of models. Self-hosted inference lets you run any open-source model (including fine-tuned or LoRA-adapted models) and swap them without changing your integration.

Self-hosted inference vs managed APIs

	Managed API	Self-hosted inference
Pricing	Per token	Pay for your own GPUs
Cost at scale	High	Up to 50x lower
Data location	Third-party servers	Your own cloud
Model selection	Fixed menu	Any open-source model
Setup complexity	None	Requires deployment
SOC2 / compliance	Depends on vendor	You control it

What does self-hosted inference involve?

At minimum, self-hosting an embedding or reranking model requires:

GPU provisioning: selecting and provisioning appropriate GPU instances (e.g. A100, L4)
Model serving: a server that loads the model and exposes an API endpoint
Batching and concurrency: handling multiple requests efficiently to maximise GPU utilisation
Monitoring: tracking latency, throughput, and GPU utilisation
Model management: loading, swapping, and updating models without downtime

This is non-trivial to build well. Tools like SIE handle all of this out of the box.

How does SIE simplify self-hosted inference?

SIE (Superlinked Inference Engine) is an open-source inference server designed specifically for search and document processing workloads. It deploys into your own AWS or GCP account and handles:

GPU cluster management via Terraform + Helm
Support for 85+ SOTA embedding, reranking, and extraction models
LoRA hot-loading (swap adapters without restarting the server)
Automatic batching for GPU efficiency
A simple SDK for encoding and reranking

# Deploy to AWS
terraform apply
helm install sie oci://ghcr.io/superlinked/charts/sie-cluster

# Use from Python
pip install sie-sdk

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("https://your-sie-endpoint")
vectors = [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=d) for d in documents])]

Your data stays in your AWS or GCP account. SIE is Apache 2.0 licensed and SOC2 Type 2 certified.

What workloads benefit most from self-hosted inference?

Self-hosted inference is particularly valuable for:

High-volume embedding pipelines: re-indexing large document corpora frequently
Real-time semantic search: low-latency encoding at query time
RAG applications: both indexing and retrieval steps at scale
Regulated data: legal, medical, financial documents that can’t leave your environment
Custom fine-tuned models: running LoRA adapters trained on your domain

Frequently asked questions

Do I need a dedicated ML team to run self-hosted inference? Not with SIE. Deployment is handled via standard DevOps tooling (Terraform, Helm). If you can deploy a Kubernetes application, you can deploy SIE.

What GPUs does SIE support? SIE supports A100-40GB, A100-80GB, L4, and L4-spot instances on AWS and GCP. Spot instances further reduce cost.

Is self-hosted inference more reliable than managed APIs? You control availability, so reliability depends on your infrastructure. SIE’s cluster mode supports horizontal scaling and failover. The trade-off is that you own the ops, but you’re not subject to third-party outages or rate limits.