First-stage retrieval over document corpora — dense, sparse (BM25), or hybrid. Foundation for all RAG pipelines.
ollama pull nomic-embed-text (~274 MB — fast embeddings).pip install chromadb (lightweight embedded vector DB).import chromadb, requests, ollama
# Index documents
client = chromadb.Client()
collection = client.create_collection("docs")
for i, doc in enumerate(documents):
emb = ollama.embed(model="nomic-embed-text", input=doc)["embeddings"][0]
collection.add(documents=[doc], embeddings=[emb], ids=[f"doc_{i}"])
# Retrieve (dense + sparse hybrid)
query = "How do I set up local speech-to-text?"
query_emb = ollama.embed(model="nomic-embed-text", input=query)["embeddings"][0]
results = collection.query(query_embeddings=[query_emb], n_results=20)
# Rerank with LLM
rerank_prompt = f"Query: {query}\n\nDocuments:\n"
for i, doc in enumerate(results["documents"][0]):
rerank_prompt += f"{i+1}. {doc[:200]}\n"
rerank_prompt += "\nRank the top 5 most relevant documents by number."
ranked_docs = ollama.generate(model="llama3.1:8b", prompt=rerank_prompt)["response"]
print(ranked_docs)
Retrieval is embarrassingly lightweight. Nomic Embed Text + ChromaDB runs on CPU at 500-1,000 documents/second for indexing, sub-100ms for retrieval on 100K documents. Any $300 laptop handles retrieval for up to 1M documents. For the full pipeline (retrieve + rerank + generate answer), add a GTX 1060 6 GB (~$60) for the LLM answer generation at 40-60 tok/s. Total: ~$360. Retrieval is the most hardware-accessible RAG component — the vector DB is the bottleneck (I/O), not the GPU.
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) handles enterprise retrieval: BGE-M3 embeddings at 1,000+ docs/second, Qdrant for production vector DB, BGE Reranker V2 M3 for precision, Llama 3.1 8B for answer generation. For a production RAG system serving 100+ concurrent users with 10M+ documents: the retrieval layer (embedding search) runs on CPU, the reranker and generator share one GPU. Total build: ~$900-1,200. The retrieval bottleneck is almost always the vector database (index build time, query latency at scale), not the embedding model speed.
The mistake: Embedding every single word in a document individually (word-level embeddings), or conversely, embedding entire 50-page PDFs as one giant embedding. Why it fails: Word-level embeddings → you retrieve fragments that are meaningless without surrounding context ("the revenue was" isn't helpful without the number). Document-level embeddings → you retrieve a 50-page PDF when the answer is on page 43 — the LLM has to read 49 irrelevant pages to find it. The fix: Chunk documents into semantic units. Rule of thumb: 256-512 tokens per chunk with 10-20% overlap between chunks. This gives each chunk enough context to be meaningful, while keeping it focused enough that retrieval is precise. For structured documents (legal contracts, research papers), chunk by section (each section = one chunk). The chunk size is the single most important retrieval parameter — test 3-5 different sizes on your specific documents.
Browse all tools for runtimes that fit this workload.
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
The errors most operators hit when running retrieval (dense + hybrid) locally. Each links to a diagnose+fix walkthrough.
Verify your specific hardware can handle retrieval (dense + hybrid) before committing money.
RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.