---
title: LanceDB
description: Use SIE as the embedding, reranking, and extraction engine for LanceDB.
canonical_url: https://superlinked.com/docs/integrations/lancedb
last_updated: 2026-05-20
---

The `sie-lancedb` package provides LanceDB-native embedding functions, a reranker for hybrid search, and an entity extractor for table enrichment (Python). Embeddings are computed automatically on `table.add()` and `table.search()` - no manual encoding needed.

**How it works:** You use SIE as an embedding function with LanceDB's schema helpers. LanceDB handles the rest - calling SIE on insert and query, and persisting the embedding config in table metadata.

## Installation

#### Python

```bash
pip install sie-lancedb
```
This installs `sie-sdk`, `lancedb` (v0.17+), `pylance`, and `pyarrow` as dependencies.

#### TypeScript

```bash
pnpm add @superlinked/sie-lancedb @lancedb/lancedb
```

## Start the Server

Source: [packages/sie_server/src/sie_server/cli.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/cli.py)

```bash
# SIE server
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default

# Or with GPU
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
```

## Embedding Function

Source: [integrations/sie_lancedb/src/sie_lancedb/embeddings.py](https://github.com/superlinked/sie/blob/main/integrations/sie_lancedb/src/sie_lancedb/embeddings.py)

`SIEEmbeddingFunction` is registered as `"sie"` in LanceDB's embedding function registry. Define your schema once, and embeddings are computed automatically on insert and search.

```python
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
import sie_lancedb  # registers "sie" and "sie-multivector"

sie = get_registry().get("sie").create(
    model="BAAI/bge-m3",
    base_url="http://localhost:8080",
)

class Documents(LanceModel):
    text: str = sie.SourceField()
    vector: Vector(sie.ndims()) = sie.VectorField()

db = lancedb.connect("~/.lancedb")
table = db.create_table("docs", schema=Documents, mode="overwrite")

# Embeddings computed automatically
table.add([
    {"text": "Machine learning is a subset of AI."},
    {"text": "Neural networks use multiple layers."},
    {"text": "Python is popular for ML development."},
])

# Query embedding computed automatically
results = table.search("What is deep learning?").limit(3).to_list()
for r in results:
    print(r["text"])
```

Any model SIE supports works - just change the `model` parameter:

```python
sie = get_registry().get("sie").create(model="NovaSearch/stella_en_400M_v5")
sie = get_registry().get("sie").create(model="nomic-ai/nomic-embed-text-v2-moe")
```

See the [Model Catalog](/models#task=encode) for all 85+ supported models.

### Configuration Options

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `base_url` | `str` | `http://localhost:8080` | SIE server URL |
| `model` | `str` | `BAAI/bge-m3` | Model to use for embeddings ([catalog](/models#task=encode)) |
| `instruction` | `str` | `None` | Instruction prefix for instruction-tuned models (e.g., E5) |
| `output_dtype` | `str` | `None` | Output data type (`float32`, `float16`, `int8`, `binary`) |
| `gpu` | `str` | `None` | Target GPU type for routing |
| `options` | `dict` | `None` | Model-specific options |
| `timeout_s` | `float` | `180.0` | Request timeout in seconds |

## Hybrid Search with Reranker

Source: [integrations/sie_lancedb/src/sie_lancedb/rerankers.py](https://github.com/superlinked/sie/blob/main/integrations/sie_lancedb/src/sie_lancedb/rerankers.py)

`SIEReranker` plugs into LanceDB's hybrid search pipeline. It uses SIE's cross-encoder `score()` to rerank combined vector + full-text search results.

```python
from sie_lancedb import SIEReranker

# Create FTS index for hybrid search
table.create_fts_index("text", replace=True)

# Hybrid search with SIE reranking
results = (
    table.search("What is deep learning?", query_type="hybrid")
    .rerank(SIEReranker(model="jinaai/jina-reranker-v2-base-multilingual"))
    .limit(5)
    .to_list()
)

for r in results:
    print(f"{r['_relevance_score']:.3f}  {r['text']}")
```

The reranker also works with pure vector or pure FTS search via `.rerank()`.

## Entity Extraction

Source: [integrations/sie_lancedb/src/sie_lancedb/extractors.py](https://github.com/superlinked/sie/blob/main/integrations/sie_lancedb/src/sie_lancedb/extractors.py)

`SIEExtractor` adds entity extraction to LanceDB's data enrichment workflows. Extract entities from a text column and merge the results back as a structured Arrow column - enabling filtered search on extracted entities.

```python
from sie_lancedb import SIEExtractor

extractor = SIEExtractor(
    base_url="http://localhost:8080",
    model="urchade/gliner_multi-v2.1",
)

# Enrich the table: reads text, extracts entities, merges back
extractor.enrich_table(
    table,
    source_column="text",
    target_column="entities",
    labels=["person", "technology", "organization"],
    id_column="id",
)
```

The `entities` column stores structured Arrow data (`list<struct<text, label, score, start, end, bbox>>`) so you can filter on extracted entities in queries.

For manual control, use `extract()` directly:

```python
entities = extractor.extract(
    ["Tim Cook leads Apple Inc.", "Elon Musk founded SpaceX."],
    labels=["person", "organization"],
)
# [[{"text": "Tim Cook", "label": "person", "score": 0.98, ...}, ...], ...]
```

## Multi-Vector (ColBERT)

Source: [integrations/sie_lancedb/src/sie_lancedb/embeddings.py](https://github.com/superlinked/sie/blob/main/integrations/sie_lancedb/src/sie_lancedb/embeddings.py)

`SIEMultiVectorEmbeddingFunction` (registered as `"sie-multivector"`) works with LanceDB's native `MultiVector` type and MaxSim scoring for ColBERT and ColPali models.

```python
from lancedb.pydantic import MultiVector

sie_colbert = get_registry().get("sie-multivector").create(
    model="jinaai/jina-colbert-v2",
    base_url="http://localhost:8080",
)

class ColBERTDocs(LanceModel):
    text: str = sie_colbert.SourceField()
    vector: MultiVector(sie_colbert.ndims()) = sie_colbert.VectorField()

table = db.create_table("colbert_docs", schema=ColBERTDocs, mode="overwrite")
table.add([{"text": "Machine learning is a subset of AI."}])

# MaxSim search - query and document multi-vectors are compared token-by-token
results = table.search("What is ML?").limit(5).to_list()
```

## What's Next

- [Encode Text](/docs/encode/) - embedding API details and output types
- [Score / Rerank](/docs/score/) - cross-encoder reranking
- [Extract](/docs/extract/) - entity extraction API
- [Model Catalog](/models#task=encode) - all supported models
- [Integrations](/docs/integrations/) - all supported frameworks and vector stores
