Why did we open-source our inference engine? Read the post
← All Glossary Articles

How Does SIE Compare to Infinity?

SIE (Superlinked Inference Engine) and Infinity are both open-source servers for self-hosting text embedding and reranking models. Infinity is a lightweight, fast single-model server with a focus on OpenAI-compatible API endpoints. SIE is a broader inference platform with multi-model support, LoRA hot-loading, GPU cluster management via Terraform and Helm, and first-class support for document processing workloads.


Quick comparison

SIEInfinity
Model typesEmbeddings, rerankers, OCR, extractionEmbeddings, rerankers, re-rank, CLIP
Multi-model per deployment✓ (shared GPU cluster)Limited (one model per instance typical)
LoRA hot-loading
GPU cluster (Terraform + Helm)Manual
AWS / GCP Terraform modules
SDK✓ (sie-sdk)OpenAI-compatible REST
OpenAI-compatible API✓ (primary design goal)
Dynamic batching
INT8 / quantisation
LicenceApache 2.0MIT
Backed bySuperlinkedMichael Feil (open source)

What is Infinity?

Infinity is a high-throughput embedding inference server created by Michael Feil. Its primary design goals are:

  • OpenAI API compatibility: drop-in replacement for OpenAI’s embedding endpoint, making it easy to swap without changing client code
  • Speed: aggressive batching, CUDA optimisations, and Flash Attention for high throughput
  • Simplicity: minimal configuration, designed to be started with a single Docker command
docker run michaelf34/infinity:latest \
v2 --model-name-or-path BAAI/bge-m3 --port 7997

It’s a strong choice for teams that need a quick self-hosted replacement for the OpenAI embeddings API.


When should you use Infinity?

Infinity is a good fit when:

  • You want an OpenAI API drop-in: your existing code uses openai.embeddings.create() and you want to swap to self-hosted without changing client code
  • You need a single model served simply and quickly
  • Your team prefers minimal configuration over infrastructure tooling
  • You’re deploying on existing infrastructure and don’t need Terraform/Helm automation

When should you use SIE?

SIE is the better choice when:

  • You need multiple models in one deployment (embedding + reranker + OCR)
  • You want LoRA adapter hot-loading: swap domain-specific adapters per-request without server restart
  • You’re deploying on AWS or GCP and want managed Terraform modules for the full cluster
  • You need document processing capabilities (OCR, extraction) alongside embeddings
  • You want a production-grade SDK rather than raw HTTP calls
  • You need SOC2 Type 2 certified infrastructure
  • You want built-in monitoring and GPU utilisation metrics

Performance comparison

Both servers implement dynamic batching and CUDA-optimised inference. For single-model, single-GPU benchmarks, Infinity and SIE achieve comparable throughput. Both are bottlenecked by the GPU, not the server layer.

The performance difference emerges at scale:

  • Multi-model workloads: SIE’s shared GPU memory pool is more efficient than running separate Infinity instances per model
  • Cluster scale: SIE’s auto-scaling handles traffic spikes; Infinity requires manual scaling
  • Concurrent mixed workloads: encoding + reranking in the same pipeline benefits from SIE’s coordinated batching

See the SIE vs TEI vs OpenAI benchmark for detailed throughput and cost data.


Migration path: Infinity → SIE

If you’re using Infinity and want to move to SIE, the transition is straightforward. SIE exposes an OpenAI-compatible endpoint, so client code changes are minimal:

# Before (Infinity or OpenAI)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:7997", api_key="dummy")
response = client.embeddings.create(model="BAAI/bge-m3", input=texts)
vectors = [e.embedding for e in response.data]
# After (SIE SDK — more features, same data)
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
results = client.encode("BAAI/bge-m3", [Item(text=t) for t in texts])
vectors = [r["dense"] for r in results]

Or keep using the OpenAI-compatible REST endpoint with the same client code, just update the base_url.


SIE vs Infinity vs TEI summary

Use caseRecommended
Quick OpenAI drop-in, single modelInfinity
Single model, HuggingFace ecosystemTEI
Production, multi-model, AWS/GCPSIE
LoRA domain adaptationSIE
Document processing + embeddingsSIE
Minimal devops, just need it workingInfinity or TEI

Frequently asked questions

Is Infinity actively maintained? Yes. Infinity is actively developed and has a growing community. It’s a legitimate production choice for single-model embedding serving.

Does SIE support the OpenAI embeddings API format? Yes. SIE exposes an OpenAI-compatible /v1/embeddings endpoint, so you can use it as a drop-in replacement without changing OpenAI client code.

Can I run SIE and Infinity in the same pipeline? In theory yes, but in practice you’d choose one. Both solve the same problem: self-hosted GPU inference for embedding models.


Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.