Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Self-Hosted Inference?

Self-hosted inference is the practice of running AI model inference on your own infrastructure, whether your own cloud account (AWS, GCP, Azure) or on-premises hardware, rather than sending requests to a third-party managed API. You control the hardware, the models, the configuration, and crucially, where your data goes.


Why does self-hosted inference matter?

Managed model APIs (OpenAI, Cohere, Voyage AI, etc.) are convenient for prototyping, but they introduce three problems at production scale:

1. Cost

Managed APIs charge per token. For embedding workloads, where you may encode millions of documents regularly, per-token pricing becomes the dominant infrastructure cost. Self-hosting on your own GPUs can reduce this by up to 50x.

2. Data privacy

Every request to a managed API sends your data to a third party’s servers. For regulated industries (legal, healthcare, finance, government), this is often a compliance blocker. Self-hosted inference keeps data entirely within your own cloud account.

3. Model control

Managed APIs offer a fixed menu of models. Self-hosted inference lets you run any open-source model (including fine-tuned or LoRA-adapted models) and swap them without changing your integration.


Self-hosted inference vs managed APIs

Managed APISelf-hosted inference
PricingPer tokenPay for your own GPUs
Cost at scaleHighUp to 50x lower
Data locationThird-party serversYour own cloud
Model selectionFixed menuAny open-source model
Setup complexityNoneRequires deployment
SOC2 / complianceDepends on vendorYou control it

What does self-hosted inference involve?

At minimum, self-hosting an embedding or reranking model requires:

  • GPU provisioning: selecting and provisioning appropriate GPU instances (e.g. A100, L4)
  • Model serving: a server that loads the model and exposes an API endpoint
  • Batching and concurrency: handling multiple requests efficiently to maximise GPU utilisation
  • Monitoring: tracking latency, throughput, and GPU utilisation
  • Model management: loading, swapping, and updating models without downtime

This is non-trivial to build well. Tools like SIE handle all of this out of the box.


How does SIE simplify self-hosted inference?

SIE (Superlinked Inference Engine) is an open-source inference server designed specifically for search and document processing workloads. It deploys into your own AWS or GCP account and handles:

  • GPU cluster management via Terraform + Helm
  • Support for 85+ SOTA embedding, reranking, and extraction models
  • LoRA hot-loading (swap adapters without restarting the server)
  • Automatic batching for GPU efficiency
  • A simple SDK for encoding and reranking
# Deploy to AWS
terraform apply
helm install sie oci://ghcr.io/superlinked/charts/sie-cluster
# Use from Python
pip install sie-sdk
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("https://your-sie-endpoint")
vectors = [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=d) for d in documents])]

Your data stays in your AWS or GCP account. SIE is Apache 2.0 licensed and SOC2 Type 2 certified.


What workloads benefit most from self-hosted inference?

Self-hosted inference is particularly valuable for:

  • High-volume embedding pipelines: re-indexing large document corpora frequently
  • Real-time semantic search: low-latency encoding at query time
  • RAG applications: both indexing and retrieval steps at scale
  • Regulated data: legal, medical, financial documents that can’t leave your environment
  • Custom fine-tuned models: running LoRA adapters trained on your domain

Frequently asked questions

Do I need a dedicated ML team to run self-hosted inference? Not with SIE. Deployment is handled via standard DevOps tooling (Terraform, Helm). If you can deploy a Kubernetes application, you can deploy SIE.

What GPUs does SIE support? SIE supports A100-40GB, A100-80GB, L4, and L4-spot instances on AWS and GCP. Spot instances further reduce cost.

Is self-hosted inference more reliable than managed APIs? You control availability, so reliability depends on your infrastructure. SIE’s cluster mode supports horizontal scaling and failover. The trade-off is that you own the ops, but you’re not subject to third-party outages or rate limits.


Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.