Retrieval-Augmented Generation (RAG)
RAG is the pattern of retrieving relevant documents from a knowledge base and including them in the LLM's prompt so the model can ground its answer in those documents. The simplest RAG: embed your docs, embed the user's query, find the top-k nearest doc chunks by cosine similarity, prepend them to the prompt.
RAG addresses three problems: knowledge cutoff (new info the model wasn't trained on), private data (docs you can't retrain on), and hallucination (model making things up). It does NOT make the model more capable — a 7B model with RAG is still a 7B model on the reasoning side.
Key design decisions: chunk size (256-1024 tokens typically), embedding model (BGE, E5, or text-embedding-3 from OpenAI), vector store (Postgres pgvector, Qdrant, Weaviate, ChromaDB), retrieval quantity (top-3 to top-20). For local-only RAG, llama.cpp + a small embedding model + SQLite-based vector store gets you started in under 100 lines of code.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.