Capability notes
[BGE-M3](/models/bge-m3) (BAAI, 568M params, 8192 token context, MIT license) is the canonical open-weight embedding model in 2026. It produces three output formats simultaneously from one forward pass: dense embeddings (1024-dim), multi-vector embeddings (ColBERT-style), and sparse lexical embeddings (BM25-equivalent) — a single model serves dense, hybrid, and sparse search.
MTEB Retrieval English: BGE-M3 = 65.8, OpenAI text-embedding-3-large = 69.1, Cohere embed-v3.0 = 68.5. BGE-M3 trails proprietary APIs by 2–5% on English but matches or exceeds on multilingual — MTEB multilingual average: BGE-M3 = 62.4, OpenAI text-embedding-3-large = 58.7 (English-optimized, weaker multilingual), Cohere = 63.1. For non-English, BGE-M3 is best regardless of API vs self-host.
Dimension tradeoffs: 1024-dim offers best recall-at-100. For lower-dimensional indexes (384-dim, 768-dim), Matryoshka Representation Learning (MRL) or PCA preserves 97–99% of retrieval at 384-dim — 2.7× storage/speed improvement for 1–3% recall drop. Rule: 1024-dim if recall matters; MRL-reduced 384-dim if index size or query throughput is the bottleneck.
[Nomic-embed-text-v1.5](/models/bge-m3) (137M params) is the efficiency alternative — 30–50% lower MTEB but 4× faster throughput, 3× smaller. Right choice for RAG corpora under 10,000 documents where the generating LLM compensates for retrieval quality.
For code embeddings, voyage-code-3 (proprietary) leads. [CodeGemma 7B](/models/codegemma-7b) repurposed as embedder produces reasonable code similarity embeddings (MTEB code retrieval ~55 vs voyage-code-3 at ~68) — but dedicated code embedders remain API-dominated.
The embedding landscape is consolidating around BGE-M3 for retrieval + BGE Reranker V2 M3 for reranking — the canonical two-model stack for open-weight RAG.
If you just want to try this
Lowest-friction path to a working setup.
Install [Ollama](/tools/ollama) and pull nomic-embed-text — the simplest path to local embeddings:
```bash
ollama pull nomic-embed-text
ollama serve
```
The embedding API runs on port 11434. From any HTTP client:
```bash
curl http://localhost:11434/api/embeddings -d '{"model": "nomic-embed-text", "prompt": "Your document text."}'
```
Returns a 768-dim float array. Zero config, multi-platform, runs on CPU or GPU automatically.
For a Python RAG pipeline:
```bash
pip install ollama chromadb
```
```python
import ollama, chromadb
client = chromadb.PersistentClient(path="./my_db")
collection = client.get_or_create_collection("docs")
emb = ollama.embed(model="nomic-embed-text", input="Your text")["embeddings"][0]
collection.add(documents=["Your text"], embeddings=[emb], ids=["doc1"])
results = collection.query(query_embeddings=[emb], n_results=5)
```
For [BGE-M3](/models/bge-m3) quality, use [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) — one Docker command:
```bash
docker run -p 8080:80 --gpus all ghcr.io/huggingface/text-embeddings-inference:bge-m3
```
Then POST to `http://localhost:8080/embed` with `{"inputs": "Your text."}`.
Hardware: nomic-embed-text on CPU at 50–100 docs/sec. BGE-M3 on CPU at 20–40 docs/sec. Any 4 GB+ GPU accelerates to 500–2,000+ docs/sec. A $300 laptop CPU runs production-quality embeddings.
For production deployment
Operator-grade recommendation.
Production embedding serving uses [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) with [BGE-M3](/models/bge-m3). TEI is a purpose-built embedding server (Rust/Python) optimizing for throughput — batches, tokenization, pooling, normalization handled internally.
**Throughput.** TEI + BGE-M3 on [RTX 4090](/hardware/rtx-4090): 2,500–3,500 docs/sec at batch=32, 1024-dim. On [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 1,500–2,200 docs/sec. On [L40S](/hardware/nvidia-l40s) (48 GB): 3,500–5,000 docs/sec at batch=64. On CPU (Ryzen 9): 20–40 docs/sec — sufficient for indexing but not interactive retrieval at scale.
**Vector database integration.** [Qdrant](/tools/qdrant): highest performance for filtered vector search. HNSW index, payload indexing, scalar quantization for 4–8× storage reduction. 10,000–50,000 queries/sec on single server. Use for production search with complex metadata. [pgvector](/tools/pgvector): embeddings alongside relational data in Postgres — no separate vector DB needed. 500–5,000 queries/sec with IVFFlat. Use when data already in Postgres. [Weaviate](/tools/weaviate): native multi-tenancy, GraphQL API, hybrid search (dense + sparse + BM25). Use when those are requirements out of the box.
**Indexing pipeline.** 1M documents (500 tokens avg): embed via TEI → store in vector DB → build HNSW index (M=16, efConstruction=200). Indexing time: ~5 minutes at 3,500 docs/sec on [RTX 4090](/hardware/rtx-4090). Index memory: 1M × 1024-dim × 4 bytes = 4.1 GB + HNSW overhead = ~8 GB total — fits in 16 GB+ RAM.
**Query latency budget.** (1) Embed query via TEI: 5–20ms GPU, 50–200ms CPU. (2) HNSW vector search top-100: 1–10ms. (3) Metadata filter: 1–5ms. (4) Optional reranking via [BGE Reranker V2 M3](/models/bge-reranker-v2-m3): 10–30ms per candidate. Total: 30–100ms GPU, 100–500ms CPU. Target GPU for interactive (<200ms); CPU acceptable for batch.
**API vs self-host.** OpenAI embedding API: $0.02/1M tokens. 1M documents/day at 500 tokens: $100/day. Self-hosting BGE-M3 on shared [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) (~$80/month): $2.67/day. Self-hosting dominates 30–50× at any non-trivial scale. The only reason to use embedding APIs: serverless deployment with <10M tokens/month.
What breaks
Failure modes operators see in the wild.
**Embedding drift on model updates.** Symptom: after updating BGE-M3 checkpoint, existing vector indexes produce different results — some documents drop out of top-k, recall degrades. Cause: model checkpoint change shifts semantic space even if architecture is identical. Vectors from v1 and v2 are not directly comparable. Mitigation: pin model versions (checkpoint hash), never update without re-indexing entire corpus. During migration, deploy new model alongside old, run both in parallel, backfill while serving from old index.
**Chunk boundary artifacts.** Symptom: search matches irrelevant documents because embedding captures unrelated sections concatenated at chunk boundaries. Cause: documents split at fixed character counts without semantic boundary awareness. Mitigation: chunk at natural boundaries — paragraph breaks, section headers, sentence boundaries — not arbitrary positions. Overlap chunks by 10–20%. Use BGE-M3's 8192-token context for larger chunks with fewer boundaries.
**Language mismatch.** Symptom: multilingual search fails — French query retrieves English documents matching topic but not language. Cause: BGE-M3's multilingual space clusters related documents regardless of language, but often with lower precision than monolingual. Mitigation: add language metadata filter if language matching matters. For hybrid retrieval, embed in native languages with language-detection pre-filter.
**Dimension collapse on poorly preprocessed text.** Symptom: dissimilar documents cluster together (cosine similarity >0.95 for different documents). Cause: excessive boilerplate, markup, or noise dominates attention, producing near-identical embeddings. Mitigation: strip HTML/XML/markup, remove navigation boilerplate, extract main content. Test: embed 100 random documents, compute pairwise similarity — verify mean <0.5 with std >0.1.
**Pooling strategy mismatch.** Symptom: BGE-M3 produces worse results with wrong pooling strategy. Cause: trained with CLS token pooling — mean pooling produces different embedding distribution, degrading accuracy by 5–15%. Mitigation: always use documented pooling. TEI handles this automatically. With sentence-transformers, verify `pooling_mode="cls"` for BGE-M3.
Hardware guidance
Embeddings have the lowest hardware barrier. BGE-M3 (568M params) runs on CPU at production-viable throughput.
**CPU-only ($0).** BGE-M3: 20–40 docs/sec on modern desktop, 10–20 on older CPUs. Sufficient for up to 100,000 docs/day (overnight run at 30 docs/sec = 1M in ~9 hours). Nomic-embed-text: 50–100 docs/sec. Query-time retrieval for <10 queries/sec: 50–200ms response — acceptable for non-interactive batch.
**Entry GPU ($300–600).** [RTX 3060 12GB](/hardware/rtx-3060-12gb): 1,000–1,500 docs/sec. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 1,500–2,200 docs/sec. [Intel Arc B580](/hardware/intel-arc-b580) via SYCL: 600–900 docs/sec. [Apple M4 Pro](/hardware/apple-m4-pro) Neural Engine: 300–500 docs/sec via CoreML.
**SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090): 2,500–3,500 docs/sec — throughput leader. [RTX 5080](/hardware/rtx-5080): 2,000–3,000 docs/sec. [Apple M3 Ultra](/hardware/apple-m3-ultra): 800–1,200 docs/sec via MLX.
**Enterprise ($8,000+).** [L40S](/hardware/nvidia-l40s) at 48 GB: 3,500–5,000 docs/sec at batch=64 — embedding throughput champion. Justified for 100+ queries/sec with low-latency requirements. At that volume, 5× throughput advantage over CPU eliminates need for load-balanced CPU clusters.
**Memory scaling.** BGE-M3 FP16 = ~1.1 GB VRAM. Fits in any GPU since 2016. Extra VRAM enables larger batches — 48 GB L40S batch=64 at negligible latency increase, delivering 2–3× aggregate throughput vs 12 GB GPU at batch=16. Embeddings are throughput-over-everything: more bandwidth → higher throughput; more VRAM → larger batches → higher throughput.
**Infrastructure sizing.** 1M docs/day indexing: single [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) at 1,500 docs/sec = 11 minutes. 100 queries/sec retrieval: single [RTX 4090](/hardware/rtx-4090) handles comfortably. 1,000 queries/sec: 2–3× [RTX 4090](/hardware/rtx-4090) behind load balancer, or single [L40S](/hardware/nvidia-l40s). Most organizations over-provision — a single consumer GPU handles a 100-person team's entire embedding needs.
Runtime guidance
**Text Embeddings Inference (TEI) vs Ollama vs sentence-transformers.**
[Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) by Hugging Face is the production embedding server. Rust + Python, optimized exclusively for embedding throughput — no chat, no generation, no KV cache. Handles tokenization, pooling, normalization internally. Docker one-command deploy, REST API at `/embed` and `/embed_sparse`. Supports: FLASH attention, dynamic batching (queues requests 5–20ms, processes as batch), Matryoshka dimension selection via query param. Production default for [BGE-M3](/models/bge-m3) and [BGE Reranker V2 M3](/models/bge-reranker-v2-m3). Tradeoff: Docker required, GPU via `--gpus` flag, model support limited to Hugging Face Transformer models (no GGUF).
[Ollama](/tools/ollama) embeddings: simplest path. `ollama pull nomic-embed-text` → `ollama serve` → `/api/embeddings`. Handles GPU detection automatically. Tradeoff: 30–50% lower throughput than TEI — Ollama optimized for generation, processes one request at a time, no dynamic batching. Right for development and single-user RAG (<1,000 docs/day). Not a production embedding server.
sentence-transformers: `pip install sentence-transformers` for programmatic Python pipelines. `model = SentenceTransformer('BAAI/bge-m3')` → `model.encode(documents)`. Right for custom preprocessing, custom pooling, batch processing in larger Python workflows. Tradeoff: manages inference optimization manually (batch size, GPU utilization, multi-GPU). With manual batching achieves 80–90% of TEI throughput but with Python overhead and developer effort.
**Vector database decision tree.** [Qdrant](/tools/qdrant): production pick. HNSW index with payload indexing, quantization for storage reduction, gRPC + REST API. Consistent sub-5ms query at any index size. [pgvector](/tools/pgvector): Postgres-native — zero additional infrastructure, vector + relational in one query. Degrades with index size — 50–200ms at 10M vectors vs Qdrant's 1–5ms. [Weaviate](/tools/weaviate): hybrid search (dense + sparse + BM25) out of the box, multi-tenant, GraphQL API.
**Production stack.** TEI + BGE-M3 for embedding serving + Qdrant for vector search + [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) for precision — the canonical open-weight retrieval stack.
Hardware buying guidance for Text Embeddings
RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.