Search by meaning rather than keyword match. Powered by embedding models + vector databases.
ollama pull nomic-embed-text (~274 MB) for embeddings.pip install chromadb for the vector database.import ollama, chromadb
client = chromadb.Client()
collection = client.create_collection("search")
# Index your documents
texts = ["How to install Docker on Ubuntu", "Python async programming guide", "Best local AI models 2026"]
for i, text in enumerate(texts):
emb = ollama.embed(model="nomic-embed-text", input=text)["embeddings"][0]
collection.add(documents=[text], embeddings=[emb], ids=[f"doc_{i}"])
# Search semantically
query = "containerization setup linux"
query_emb = ollama.embed(model="nomic-embed-text", input=query)["embeddings"][0]
results = collection.query(query_embeddings=[query_emb], n_results=3)
print(results["documents"]) # Finds "Docker on Ubuntu" even though "Docker" isn't in the query
Semantic search is trivially cheap. Nomic Embed Text + ChromaDB runs on any $300 laptop — indexes 100K documents in 5 minutes, searches in <50ms. For a personal knowledge base, company wiki, or documentation search: $300 is all you need. For the full answer-generation pipeline (search → retrieve → LLM generates answer), add a used GTX 1060 6 GB ($60) for the LLM. Total: ~$360. Semantic search is the highest-ROI local AI task — 10 lines of Python transforms "grep" into "Google for your own documents."
Any RTX GPU is overkill for semantic search alone. The embedding + vector search runs on CPU. For enterprise semantic search (10M+ documents, 100+ concurrent users, permissions-aware) use: BGE-M3 embeddings (GPU-accelerated for indexing speed, CPU for search), Qdrant (distributed vector DB with filtering), BGE Reranker V2 M3 (GPU for precision), Llama 3.1 8B (GPU for answer synthesis). Compute: RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) + Epyc/Xeon CPU server for Qdrant. Total: ~$1,500-2,500. The CPU/DB costs dominate at enterprise scale — the GPU is the cheapest component.
The mistake: Replacing all keyword search with semantic search, then wondering why searching for "error code E5001" returns documents about "server errors" instead of the exact error code. Why it fails: Semantic search optimizes for meaning, not exactness. "E5001" is a specific error code — the embedding model sees it as a number, not a concept. It matches to documents about "errors" broadly, not the specific error. For exact IDs, error codes, version numbers, and proper nouns, keyword search is superior. The fix: Use hybrid search: BM25 (keyword) + embeddings (semantic). For queries: if the query contains codes/IDs/versions, weight BM25 higher. If the query is a natural language question, weight embeddings higher. Or: always retrieve top-50 from both and merge (reciprocal rank fusion). Semantic search is not a replacement for keyword search — it's a complement. Hybrid always beats either alone.
Browse all tools for runtimes that fit this workload.
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
The errors most operators hit when running semantic search locally. Each links to a diagnose+fix walkthrough.
Verify your specific hardware can handle semantic search before committing money.
RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.