Text Embeddings

Capability notes

[BGE-M3](/models/bge-m3) (BAAI, 568M params, 8192 token context, MIT license) is the canonical open-weight embedding model in 2026. It produces three output formats simultaneously from one forward pass: dense embeddings (1024-dim), multi-vector embeddings (ColBERT-style), and sparse lexical embeddings (BM25-equivalent) — a single model serves dense, hybrid, and sparse search. MTEB Retrieval English: BGE-M3 = 65.8, OpenAI text-embedding-3-large = 69.1, Cohere embed-v3.0 = 68.5. BGE-M3 trails proprietary APIs by 2–5% on English but matches or exceeds on multilingual — MTEB multilingual average: BGE-M3 = 62.4, OpenAI text-embedding-3-large = 58.7 (English-optimized, weaker multilingual), Cohere = 63.1. For non-English, BGE-M3 is best regardless of API vs self-host. Dimension tradeoffs: 1024-dim offers best recall-at-100. For lower-dimensional indexes (384-dim, 768-dim), Matryoshka Representation Learning (MRL) or PCA preserves 97–99% of retrieval at 384-dim — 2.7× storage/speed improvement for 1–3% recall drop. Rule: 1024-dim if recall matters; MRL-reduced 384-dim if index size or query throughput is the bottleneck. [Nomic-embed-text-v1.5](/models/bge-m3) (137M params) is the efficiency alternative — 30–50% lower MTEB but 4× faster throughput, 3× smaller. Right choice for RAG corpora under 10,000 documents where the generating LLM compensates for retrieval quality. For code embeddings, voyage-code-3 (proprietary) leads. [CodeGemma 7B](/models/codegemma-7b) repurposed as embedder produces reasonable code similarity embeddings (MTEB code retrieval ~55 vs voyage-code-3 at ~68) — but dedicated code embedders remain API-dominated. The embedding landscape is consolidating around BGE-M3 for retrieval + BGE Reranker V2 M3 for reranking — the canonical two-model stack for open-weight RAG.

If you just want to try this

Lowest-friction path to a working setup.

Install [Ollama](/tools/ollama) and pull nomic-embed-text — the simplest path to local embeddings: ```bash ollama pull nomic-embed-text ollama serve ``` The embedding API runs on port 11434. From any HTTP client: ```bash curl http://localhost:11434/api/embeddings -d '{"model": "nomic-embed-text", "prompt": "Your document text."}' ``` Returns a 768-dim float array. Zero config, multi-platform, runs on CPU or GPU automatically. For a Python RAG pipeline: ```bash pip install ollama chromadb ``` ```python import ollama, chromadb client = chromadb.PersistentClient(path="./my_db") collection = client.get_or_create_collection("docs") emb = ollama.embed(model="nomic-embed-text", input="Your text")["embeddings"][0] collection.add(documents=["Your text"], embeddings=[emb], ids=["doc1"]) results = collection.query(query_embeddings=[emb], n_results=5) ``` For [BGE-M3](/models/bge-m3) quality, use [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) — one Docker command: ```bash docker run -p 8080:80 --gpus all ghcr.io/huggingface/text-embeddings-inference:bge-m3 ``` Then POST to `http://localhost:8080/embed` with `{"inputs": "Your text."}`. Hardware: nomic-embed-text on CPU at 50–100 docs/sec. BGE-M3 on CPU at 20–40 docs/sec. Any 4 GB+ GPU accelerates to 500–2,000+ docs/sec. A $300 laptop CPU runs production-quality embeddings.

For production deployment

Operator-grade recommendation.

Production embedding serving uses [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) with [BGE-M3](/models/bge-m3). TEI is a purpose-built embedding server (Rust/Python) optimizing for throughput — batches, tokenization, pooling, normalization handled internally. **Throughput.** TEI + BGE-M3 on [RTX 4090](/hardware/rtx-4090): 2,500–3,500 docs/sec at batch=32, 1024-dim. On [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 1,500–2,200 docs/sec. On [L40S](/hardware/nvidia-l40s) (48 GB): 3,500–5,000 docs/sec at batch=64. On CPU (Ryzen 9): 20–40 docs/sec — sufficient for indexing but not interactive retrieval at scale. **Vector database integration.** [Qdrant](/tools/qdrant): highest performance for filtered vector search. HNSW index, payload indexing, scalar quantization for 4–8× storage reduction. 10,000–50,000 queries/sec on single server. Use for production search with complex metadata. [pgvector](/tools/pgvector): embeddings alongside relational data in Postgres — no separate vector DB needed. 500–5,000 queries/sec with IVFFlat. Use when data already in Postgres. [Weaviate](/tools/weaviate): native multi-tenancy, GraphQL API, hybrid search (dense + sparse + BM25). Use when those are requirements out of the box. **Indexing pipeline.** 1M documents (500 tokens avg): embed via TEI → store in vector DB → build HNSW index (M=16, efConstruction=200). Indexing time: ~5 minutes at 3,500 docs/sec on [RTX 4090](/hardware/rtx-4090). Index memory: 1M × 1024-dim × 4 bytes = 4.1 GB + HNSW overhead = ~8 GB total — fits in 16 GB+ RAM. **Query latency budget.** (1) Embed query via TEI: 5–20ms GPU, 50–200ms CPU. (2) HNSW vector search top-100: 1–10ms. (3) Metadata filter: 1–5ms. (4) Optional reranking via [BGE Reranker V2 M3](/models/bge-reranker-v2-m3): 10–30ms per candidate. Total: 30–100ms GPU, 100–500ms CPU. Target GPU for interactive (<200ms); CPU acceptable for batch. **API vs self-host.** OpenAI embedding API: $0.02/1M tokens. 1M documents/day at 500 tokens: $100/day. Self-hosting BGE-M3 on shared [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) (~$80/month): $2.67/day. Self-hosting dominates 30–50× at any non-trivial scale. The only reason to use embedding APIs: serverless deployment with <10M tokens/month.

What breaks

Failure modes operators see in the wild.

**Embedding drift on model updates.** Symptom: after updating BGE-M3 checkpoint, existing vector indexes produce different results — some documents drop out of top-k, recall degrades. Cause: model checkpoint change shifts semantic space even if architecture is identical. Vectors from v1 and v2 are not directly comparable. Mitigation: pin model versions (checkpoint hash), never update without re-indexing entire corpus. During migration, deploy new model alongside old, run both in parallel, backfill while serving from old index. **Chunk boundary artifacts.** Symptom: search matches irrelevant documents because embedding captures unrelated sections concatenated at chunk boundaries. Cause: documents split at fixed character counts without semantic boundary awareness. Mitigation: chunk at natural boundaries — paragraph breaks, section headers, sentence boundaries — not arbitrary positions. Overlap chunks by 10–20%. Use BGE-M3's 8192-token context for larger chunks with fewer boundaries. **Language mismatch.** Symptom: multilingual search fails — French query retrieves English documents matching topic but not language. Cause: BGE-M3's multilingual space clusters related documents regardless of language, but often with lower precision than monolingual. Mitigation: add language metadata filter if language matching matters. For hybrid retrieval, embed in native languages with language-detection pre-filter. **Dimension collapse on poorly preprocessed text.** Symptom: dissimilar documents cluster together (cosine similarity >0.95 for different documents). Cause: excessive boilerplate, markup, or noise dominates attention, producing near-identical embeddings. Mitigation: strip HTML/XML/markup, remove navigation boilerplate, extract main content. Test: embed 100 random documents, compute pairwise similarity — verify mean <0.5 with std >0.1. **Pooling strategy mismatch.** Symptom: BGE-M3 produces worse results with wrong pooling strategy. Cause: trained with CLS token pooling — mean pooling produces different embedding distribution, degrading accuracy by 5–15%. Mitigation: always use documented pooling. TEI handles this automatically. With sentence-transformers, verify `pooling_mode="cls"` for BGE-M3.

Hardware guidance

Embeddings have the lowest hardware barrier. BGE-M3 (568M params) runs on CPU at production-viable throughput. **CPU-only ($0).** BGE-M3: 20–40 docs/sec on modern desktop, 10–20 on older CPUs. Sufficient for up to 100,000 docs/day (overnight run at 30 docs/sec = 1M in ~9 hours). Nomic-embed-text: 50–100 docs/sec. Query-time retrieval for <10 queries/sec: 50–200ms response — acceptable for non-interactive batch. **Entry GPU ($300–600).** [RTX 3060 12GB](/hardware/rtx-3060-12gb): 1,000–1,500 docs/sec. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 1,500–2,200 docs/sec. [Intel Arc B580](/hardware/intel-arc-b580) via SYCL: 600–900 docs/sec. [Apple M4 Pro](/hardware/apple-m4-pro) Neural Engine: 300–500 docs/sec via CoreML. **SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090): 2,500–3,500 docs/sec — throughput leader. [RTX 5080](/hardware/rtx-5080): 2,000–3,000 docs/sec. [Apple M3 Ultra](/hardware/apple-m3-ultra): 800–1,200 docs/sec via MLX. **Enterprise ($8,000+).** [L40S](/hardware/nvidia-l40s) at 48 GB: 3,500–5,000 docs/sec at batch=64 — embedding throughput champion. Justified for 100+ queries/sec with low-latency requirements. At that volume, 5× throughput advantage over CPU eliminates need for load-balanced CPU clusters. **Memory scaling.** BGE-M3 FP16 = ~1.1 GB VRAM. Fits in any GPU since 2016. Extra VRAM enables larger batches — 48 GB L40S batch=64 at negligible latency increase, delivering 2–3× aggregate throughput vs 12 GB GPU at batch=16. Embeddings are throughput-over-everything: more bandwidth → higher throughput; more VRAM → larger batches → higher throughput. **Infrastructure sizing.** 1M docs/day indexing: single [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) at 1,500 docs/sec = 11 minutes. 100 queries/sec retrieval: single [RTX 4090](/hardware/rtx-4090) handles comfortably. 1,000 queries/sec: 2–3× [RTX 4090](/hardware/rtx-4090) behind load balancer, or single [L40S](/hardware/nvidia-l40s). Most organizations over-provision — a single consumer GPU handles a 100-person team's entire embedding needs.

Runtime guidance

**Text Embeddings Inference (TEI) vs Ollama vs sentence-transformers.** [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) by Hugging Face is the production embedding server. Rust + Python, optimized exclusively for embedding throughput — no chat, no generation, no KV cache. Handles tokenization, pooling, normalization internally. Docker one-command deploy, REST API at `/embed` and `/embed_sparse`. Supports: FLASH attention, dynamic batching (queues requests 5–20ms, processes as batch), Matryoshka dimension selection via query param. Production default for [BGE-M3](/models/bge-m3) and [BGE Reranker V2 M3](/models/bge-reranker-v2-m3). Tradeoff: Docker required, GPU via `--gpus` flag, model support limited to Hugging Face Transformer models (no GGUF). [Ollama](/tools/ollama) embeddings: simplest path. `ollama pull nomic-embed-text` → `ollama serve` → `/api/embeddings`. Handles GPU detection automatically. Tradeoff: 30–50% lower throughput than TEI — Ollama optimized for generation, processes one request at a time, no dynamic batching. Right for development and single-user RAG (<1,000 docs/day). Not a production embedding server. sentence-transformers: `pip install sentence-transformers` for programmatic Python pipelines. `model = SentenceTransformer('BAAI/bge-m3')` → `model.encode(documents)`. Right for custom preprocessing, custom pooling, batch processing in larger Python workflows. Tradeoff: manages inference optimization manually (batch size, GPU utilization, multi-GPU). With manual batching achieves 80–90% of TEI throughput but with Python overhead and developer effort. **Vector database decision tree.** [Qdrant](/tools/qdrant): production pick. HNSW index with payload indexing, quantization for storage reduction, gRPC + REST API. Consistent sub-5ms query at any index size. [pgvector](/tools/pgvector): Postgres-native — zero additional infrastructure, vector + relational in one query. Degrades with index size — 50–200ms at 10M vectors vs Qdrant's 1–5ms. [Weaviate](/tools/weaviate): hybrid search (dense + sparse + BM25) out of the box, multi-tenant, GraphQL API. **Production stack.** TEI + BGE-M3 for embedding serving + Qdrant for vector search + [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) for precision — the canonical open-weight retrieval stack.

Setup walkthrough

Install Ollama → ollama pull nomic-embed-text (~274 MB — tiny, fast, CPU-friendly).
ollama serve then test via curl or Python:

import requests
resp = requests.post("http://localhost:11434/api/embeddings",
    json={"model": "nomic-embed-text", "prompt": "The quick brown fox"})
print(resp.json()["embedding"][:5])  # first 5 dims of 768

First embedding in <1 second. 768-dimensional vectors — good for semantic search, clustering, RAG.
For multilingual: ollama pull bge-m3 (~1.2 GB) — 1024-dim vectors, 100+ languages.
For larger-scale RAG: combine with ChromaDB (pip install chromadb) or LanceDB for vector storage.

The cheap setup

Embeddings are extremely CPU-friendly. Nomic Embed Text runs at 500-1000 texts/second on any modern laptop CPU (Ryzen 5/Intel i5). BGE-M3 runs at 50-100 texts/second on CPU — fine for indexing 10k documents overnight. No GPU required for small-to-medium corpora. A $300 used laptop handles embeddings for up to 100k documents. For larger corpora (>1M documents), a used GTX 1060 6 GB ($60) provides 5-10× speedup. EMB is the most hardware-accessible AI task.

The serious setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb) handles production embedding at ~5,000-10,000 texts/second for BGE-M3. Can embed 1M documents in ~2-3 minutes. For enterprise-scale RAG (10M+ documents), an RTX 3090 24 GB ($700-900, see /hardware/rtx-3090) with vLLM embedding API provides 20,000+ texts/second. Total build for serious embedding infra: ~$900-1,100. Priority is fast storage (NVMe) for the vector DB, not GPU compute.

Common beginner mistake

The mistake: Using GPT-4 or Llama 3.3 70B to generate embeddings for a RAG pipeline, then wondering why indexing 100k documents takes 12 hours and costs a fortune. Why it fails: LLMs generate one token at a time and are designed for generation — they produce embedding-sized vectors only as a byproduct. They're 100-1000× slower than dedicated embedding models and the vectors aren't optimized for similarity search. The fix: Use dedicated embedding models (Nomic Embed Text, BGE-M3, GTE-large). These are trained specifically to produce similarity-optimized vectors. BGE-M3 embeds 100k documents in ~30 seconds on a GPU vs. hours on an LLM. Embedding models are small (100 MB-2 GB) and fast because they're encoder-only — one forward pass, no autoregressive generation.

Recommended setup for text embeddings

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running text embeddings locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle text embeddings before committing money.

Hardware buying guidance for Text Embeddings

RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.