RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Tasks/RAG & Search/Retrieval (Dense + Hybrid)
RAG & Search
dense retrieval
hybrid search

Retrieval (Dense + Hybrid)

First-stage retrieval over document corpora — dense, sparse (BM25), or hybrid. Foundation for all RAG pipelines.

Setup walkthrough

  1. Install Ollama → ollama pull nomic-embed-text (~274 MB — fast embeddings).
  2. pip install chromadb (lightweight embedded vector DB).
  3. Full retrieval pipeline:
import chromadb, requests, ollama

# Index documents
client = chromadb.Client()
collection = client.create_collection("docs")
for i, doc in enumerate(documents):
    emb = ollama.embed(model="nomic-embed-text", input=doc)["embeddings"][0]
    collection.add(documents=[doc], embeddings=[emb], ids=[f"doc_{i}"])

# Retrieve (dense + sparse hybrid)
query = "How do I set up local speech-to-text?"
query_emb = ollama.embed(model="nomic-embed-text", input=query)["embeddings"][0]
results = collection.query(query_embeddings=[query_emb], n_results=20)

# Rerank with LLM
rerank_prompt = f"Query: {query}\n\nDocuments:\n"
for i, doc in enumerate(results["documents"][0]):
    rerank_prompt += f"{i+1}. {doc[:200]}\n"
rerank_prompt += "\nRank the top 5 most relevant documents by number."
ranked_docs = ollama.generate(model="llama3.1:8b", prompt=rerank_prompt)["response"]
print(ranked_docs)
  1. First retrieval in <100ms for 10K documents. End-to-end (embed + retrieve + rerank + generate answer) in 2-5 seconds.

The cheap setup

Retrieval is embarrassingly lightweight. Nomic Embed Text + ChromaDB runs on CPU at 500-1,000 documents/second for indexing, sub-100ms for retrieval on 100K documents. Any $300 laptop handles retrieval for up to 1M documents. For the full pipeline (retrieve + rerank + generate answer), add a GTX 1060 6 GB (~$60) for the LLM answer generation at 40-60 tok/s. Total: ~$360. Retrieval is the most hardware-accessible RAG component — the vector DB is the bottleneck (I/O), not the GPU.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) handles enterprise retrieval: BGE-M3 embeddings at 1,000+ docs/second, Qdrant for production vector DB, BGE Reranker V2 M3 for precision, Llama 3.1 8B for answer generation. For a production RAG system serving 100+ concurrent users with 10M+ documents: the retrieval layer (embedding search) runs on CPU, the reranker and generator share one GPU. Total build: ~$900-1,200. The retrieval bottleneck is almost always the vector database (index build time, query latency at scale), not the embedding model speed.

Common beginner mistake

The mistake: Embedding every single word in a document individually (word-level embeddings), or conversely, embedding entire 50-page PDFs as one giant embedding. Why it fails: Word-level embeddings → you retrieve fragments that are meaningless without surrounding context ("the revenue was" isn't helpful without the number). Document-level embeddings → you retrieve a 50-page PDF when the answer is on page 43 — the LLM has to read 49 irrelevant pages to find it. The fix: Chunk documents into semantic units. Rule of thumb: 256-512 tokens per chunk with 10-20% overlap between chunks. This gives each chunk enough context to be meaningful, while keeping it focused enough that retrieval is precise. For structured documents (legal contracts, research papers), chunk by section (each section = one chunk). The chunk size is the single most important retrieval parameter — test 3-5 different sizes on your specific documents.

Recommended setup for retrieval (dense + hybrid)

Recommended hardware
Best GPU for local AI →
All workloads ranked across VRAM tiers.
Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build
AI PC under $1,000 →
Best GPU for this task
Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running retrieval (dense + hybrid) locally. Each links to a diagnose+fix walkthrough.

  • CUDA out of memory →
  • Model keeps crashing →
  • Ollama running slow →
  • llama.cpp too slow →

Before you buy

Verify your specific hardware can handle retrieval (dense + hybrid) before committing money.

  • Will it run on my hardware? →
  • Custom compatibility check →
  • GPU recommender (4 questions) →
Hardware buying guidance for Retrieval (Dense + Hybrid)

RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.

  • best GPU for RAG
  • AI PC for small business

Related tasks

Text EmbeddingsSemantic Search
Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
  • Will it run on my hardware? →
Compare hardware
  • Curated head-to-heads →
  • Custom comparison tool →
  • RTX 4090 vs RTX 5090 →
  • RTX 3090 vs RTX 4090 →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Specialized buyer guides
  • GPU for ComfyUI (image-gen) →
  • GPU for KoboldCpp (RP/long-context) →
  • GPU for AI agents →
  • GPU for local OCR →
  • GPU for voice cloning →
  • Upgrade from RTX 3060 →
  • Beginner setup →
  • AI PC for students →
Updated 2026 roundup
  • Best free local AI tools (2026) →