Retrieval (Dense + Hybrid)

First-stage retrieval over document corpora — dense, sparse (BM25), or hybrid. Foundation for all RAG pipelines.

Setup walkthrough

Install Ollama → ollama pull nomic-embed-text (~274 MB — fast embeddings).
pip install chromadb (lightweight embedded vector DB).
Full retrieval pipeline:

import chromadb, requests, ollama

# Index documents
client = chromadb.Client()
collection = client.create_collection("docs")
for i, doc in enumerate(documents):
    emb = ollama.embed(model="nomic-embed-text", input=doc)["embeddings"][0]
    collection.add(documents=[doc], embeddings=[emb], ids=[f"doc_{i}"])

# Retrieve (dense + sparse hybrid)
query = "How do I set up local speech-to-text?"
query_emb = ollama.embed(model="nomic-embed-text", input=query)["embeddings"][0]
results = collection.query(query_embeddings=[query_emb], n_results=20)

# Rerank with LLM
rerank_prompt = f"Query: {query}\n\nDocuments:\n"
for i, doc in enumerate(results["documents"][0]):
    rerank_prompt += f"{i+1}. {doc[:200]}\n"
rerank_prompt += "\nRank the top 5 most relevant documents by number."
ranked_docs = ollama.generate(model="llama3.1:8b", prompt=rerank_prompt)["response"]
print(ranked_docs)

First retrieval in <100ms for 10K documents. End-to-end (embed + retrieve + rerank + generate answer) in 2-5 seconds.

The cheap setup

Retrieval is embarrassingly lightweight. Nomic Embed Text + ChromaDB runs on CPU at 500-1,000 documents/second for indexing, sub-100ms for retrieval on 100K documents. Any $300 laptop handles retrieval for up to 1M documents. For the full pipeline (retrieve + rerank + generate answer), add a GTX 1060 6 GB (~$60) for the LLM answer generation at 40-60 tok/s. Total: ~$360. Retrieval is the most hardware-accessible RAG component — the vector DB is the bottleneck (I/O), not the GPU.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) handles enterprise retrieval: BGE-M3 embeddings at 1,000+ docs/second, Qdrant for production vector DB, BGE Reranker V2 M3 for precision, Llama 3.1 8B for answer generation. For a production RAG system serving 100+ concurrent users with 10M+ documents: the retrieval layer (embedding search) runs on CPU, the reranker and generator share one GPU. Total build: ~$900-1,200. The retrieval bottleneck is almost always the vector database (index build time, query latency at scale), not the embedding model speed.

Common beginner mistake

The mistake: Embedding every single word in a document individually (word-level embeddings), or conversely, embedding entire 50-page PDFs as one giant embedding. Why it fails: Word-level embeddings → you retrieve fragments that are meaningless without surrounding context ("the revenue was" isn't helpful without the number). Document-level embeddings → you retrieve a 50-page PDF when the answer is on page 43 — the LLM has to read 49 irrelevant pages to find it. The fix: Chunk documents into semantic units. Rule of thumb: 256-512 tokens per chunk with 10-20% overlap between chunks. This gives each chunk enough context to be meaningful, while keeping it focused enough that retrieval is precise. For structured documents (legal contracts, research papers), chunk by section (each section = one chunk). The chunk size is the single most important retrieval parameter — test 3-5 different sizes on your specific documents.

Recommended setup for retrieval (dense + hybrid)

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running retrieval (dense + hybrid) locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle retrieval (dense + hybrid) before committing money.

Hardware buying guidance for Retrieval (Dense + Hybrid)

RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.

Related tasks

Text Embeddings Semantic Search

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →