Semantic Search

Search by meaning rather than keyword match. Powered by embedding models + vector databases.

Setup walkthrough

Install Ollama → ollama pull nomic-embed-text (~274 MB) for embeddings.
pip install chromadb for the vector database.
Semantic search in 15 lines:

import ollama, chromadb
client = chromadb.Client()
collection = client.create_collection("search")

# Index your documents
texts = ["How to install Docker on Ubuntu", "Python async programming guide", "Best local AI models 2026"]
for i, text in enumerate(texts):
    emb = ollama.embed(model="nomic-embed-text", input=text)["embeddings"][0]
    collection.add(documents=[text], embeddings=[emb], ids=[f"doc_{i}"])

# Search semantically
query = "containerization setup linux"
query_emb = ollama.embed(model="nomic-embed-text", input=query)["embeddings"][0]
results = collection.query(query_embeddings=[query_emb], n_results=3)
print(results["documents"])  # Finds "Docker on Ubuntu" even though "Docker" isn't in the query

First search in <50ms. The query "containerization setup linux" matches "Docker on Ubuntu" — semantic search understands meaning, not keywords.
For hybrid search (semantic + keyword): ChromaDB doesn't support BM25 natively. Use LanceDB or Qdrant for production hybrid search, or combine results programmatically.

The cheap setup

Semantic search is trivially cheap. Nomic Embed Text + ChromaDB runs on any $300 laptop — indexes 100K documents in 5 minutes, searches in <50ms. For a personal knowledge base, company wiki, or documentation search: $300 is all you need. For the full answer-generation pipeline (search → retrieve → LLM generates answer), add a used GTX 1060 6 GB ($60) for the LLM. Total: ~$360. Semantic search is the highest-ROI local AI task — 10 lines of Python transforms "grep" into "Google for your own documents."

The serious setup

Any RTX GPU is overkill for semantic search alone. The embedding + vector search runs on CPU. For enterprise semantic search (10M+ documents, 100+ concurrent users, permissions-aware) use: BGE-M3 embeddings (GPU-accelerated for indexing speed, CPU for search), Qdrant (distributed vector DB with filtering), BGE Reranker V2 M3 (GPU for precision), Llama 3.1 8B (GPU for answer synthesis). Compute: RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) + Epyc/Xeon CPU server for Qdrant. Total: ~$1,500-2,500. The CPU/DB costs dominate at enterprise scale — the GPU is the cheapest component.

Common beginner mistake

The mistake: Replacing all keyword search with semantic search, then wondering why searching for "error code E5001" returns documents about "server errors" instead of the exact error code. Why it fails: Semantic search optimizes for meaning, not exactness. "E5001" is a specific error code — the embedding model sees it as a number, not a concept. It matches to documents about "errors" broadly, not the specific error. For exact IDs, error codes, version numbers, and proper nouns, keyword search is superior. The fix: Use hybrid search: BM25 (keyword) + embeddings (semantic). For queries: if the query contains codes/IDs/versions, weight BM25 higher. If the query is a natural language question, weight embeddings higher. Or: always retrieve top-50 from both and merge (reciprocal rank fusion). Semantic search is not a replacement for keyword search — it's a complement. Hybrid always beats either alone.

Recommended setup for semantic search

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running semantic search locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle semantic search before committing money.

Hardware buying guidance for Semantic Search

RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →