Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is a LoRA Adapter?

A LoRA (Low-Rank Adaptation) adapter is a lightweight set of trainable weight matrices added to specific layers of a pre-trained neural network. During fine-tuning, only the LoRA weights are updated; the base model weights remain frozen. This reduces the number of trainable parameters by 100-1000x compared to full fine-tuning, making domain adaptation practical without large compute budgets.


Why does LoRA matter for inference?

LoRA solves a key problem in deploying embedding models: general-purpose models trained on broad data underperform on specialised domains (legal, medical, financial, code). Full fine-tuning is expensive. It requires updating hundreds of millions of parameters and storing a complete copy of the model for each domain.

LoRA adapters are small (typically 10-100MB vs 1-4GB for a full model) and can be swapped at runtime. This means a single base model can serve multiple domains by loading the appropriate adapter, without restarting the inference server.

SIE supports LoRA hot-loading: swap adapters between requests with zero downtime.


How does LoRA work?

A standard neural network weight matrix W has dimensions d × k. Full fine-tuning updates every element of W, that’s d × k parameters.

LoRA instead decomposes the weight update into two low-rank matrices:

W' = W + ΔW = W + BA

Where:

  • B has dimensions d × r
  • A has dimensions r × k
  • r is the rank (typically 4-64, much smaller than d or k)

During fine-tuning, only A and B are trained. The original W is frozen.

Parameters saved = d×k − (d×r + r×k) = d×k − r×(d+k)

For a weight matrix of 768×768 with rank r=16: full fine-tuning = 589,824 parameters; LoRA = 24,576 parameters, a 24× reduction.


Which layers get LoRA adapters?

LoRA is typically applied to the attention weight matrices in transformer layers:

  • Query projection (Wq)
  • Key projection (Wk)
  • Value projection (Wv)
  • Output projection (Wo)

Optionally also applied to the feed-forward layers. More layers = more parameters = more expressivity, at the cost of size.


How do you use LoRA with SIE?

SIE supports LoRA hot-loading, so you can apply a domain-specific adapter at inference time:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# General-purpose encoding
general_vectors = [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=d) for d in documents])]
# Legal domain encoding with LoRA adapter
legal_vectors = [
r["dense"]
for r in client.encode(
"BAAI/bge-m3",
[Item(text=d) for d in documents],
options={"lora_id": "org/bge-m3-legal-lora"},
)
]
# Medical domain encoding with different adapter
medical_vectors = [
r["dense"]
for r in client.encode(
"BAAI/bge-m3",
[Item(text=d) for d in documents],
options={"lora_id": "org/bge-m3-medical-lora"},
)
]

Multiple adapters can be loaded simultaneously and selected per-request. The base model weights are shared; only the small adapter matrices differ.


LoRA vs full fine-tuning vs prompt tuning

Full fine-tuningLoRAPrompt tuning
Parameters updatedAll (100%)~0.1-1%<0.01%
Storage per domainFull model copySmall adapterTiny prompt
QualityHighestNear-fullLower
Training costHighLowLowest
Inference costNormalNormal + tiny overheadNormal
Hot-swap at runtime✓ (SIE)

For most domain adaptation use cases, LoRA provides the best accuracy-cost trade-off.


Rank selection: how do you choose r?

The rank r controls the adapter’s capacity:

RankParametersWhen to use
4-8MinimalSimple style/tone adaptation
16LowStandard domain adaptation
32MediumComplex domain shift
64+HighApproaching full fine-tune quality

Start with r=16 for most domain adaptation tasks. Increase if validation metrics plateau.


Training a LoRA adapter for your domain

You need (query, positive document) pairs from your domain, the same training signal used for embedding model training:

from peft import LoraConfig, get_peft_model
from transformers import AutoModel
# Load base model
base_model = AutoModel.from_pretrained("BAAI/bge-m3")
# Apply LoRA configuration
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["query", "key", "value"],
lora_dropout=0.1,
)
peft_model = get_peft_model(base_model, lora_config)
# Train on domain-specific (query, positive) pairs
# ... training loop ...
# Save adapter only (~50MB vs ~2GB for full model)
peft_model.save_pretrained("legal-lora-adapter/")

The adapter can then be loaded into SIE for hot-swap deployment.


Frequently asked questions

Does a LoRA adapter change model inference speed? Negligibly. The adapter matrices are small and the extra computation is minimal. SIE’s batching absorbs this overhead.

Can I combine LoRA with quantisation? Yes. QLoRA (Quantised LoRA) quantises the base model to 4-bit precision and adds LoRA adapters in full precision. This is a common approach for fine-tuning large models on consumer hardware.

How much domain-specific training data do I need? LoRA adapters can be effective with as few as hundreds of (query, document) pairs. More data helps, but the low parameter count means LoRA is significantly less data-hungry than full fine-tuning.


Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.