Persistent KV cache vs RAG — which one should I use for 'chat with my docs'?

Reviewed May 15, 20262 min read
kv-cacheragvllmprefix-cachingcontext-length

The answer

One paragraph. No hedging beyond what the data actually warrants.

Use persistent KV cache when your docs fit in the model's context. Use RAG when they don't.

Both solve "the model needs context it wasn't trained on." They solve it differently:

Persistent KV cache (prefix caching): The model processes your docs ONCE, the attention key+value tensors get cached in GPU memory, and every subsequent question re-uses that prefill. vLLM, llama.cpp, and SGLang all support this. The latency math:

  • First request (cold): ~1 token-time per input token (e.g., 32K context = 5-10s prefill on RTX 4090)
  • Every subsequent request (warm): ~50-100ms prefill — the cache hit
  • Memory cost: KV cache size = 2 × num_layers × num_kv_heads × head_dim × tokens × bytes_per_value

For Llama 3.1 8B at 32K context: ~3-4 GB of cache. Fits comfortably on a 12-16GB card.

RAG (retrieval-augmented generation): You build a vector index of your docs. Every question triggers: embed query → retrieve top-K chunks → stuff into prompt → generate. The latency math:

  • Embedding step: ~50-150ms (local embedder) or ~200-400ms (cloud API)
  • Vector search: ~5-20ms on a 100K-chunk index
  • Generation: full prefill of (query + retrieved chunks) — typically 4-8K tokens = 1-2s on a 4090
  • Total per question: ~1.5-2.5s end-to-end

The decision rule:

Your corpus Pick
Single document (< 200K tokens) Persistent KV cache. Faster, simpler, no retrieval drift.
5-10 docs you re-read constantly Persistent KV cache, swap models between them.
Large corpus (1000+ docs) RAG. KV cache doesn't fit.
Mixed: hot 5 + cold archive Hybrid — KV cache for hot, RAG for cold.

Why the r/Rag thread "we replaced RAG with persistent KV cache" works:

  • Your application's "context" is a fixed set of code files / docs / specs that DON'T change per-query.
  • Embedding + retrieval adds latency without much quality gain when your corpus is small enough to keep warm.
  • KV cache hits beat retrieval round-trips for the latency-sensitive use cases (interactive chat, IDE-integrated agents).

Why RAG still wins for most teams:

  • Your corpus is bigger than VRAM × layers can hold as cached prefix.
  • You need to add new documents continuously (KV cache invalidates when prefix changes).
  • You need source-level citations (RAG gives you chunk-level attribution; KV cache doesn't).
  • You're serving multi-tenant queries where each user has their own document set.

The honest middle ground: prefix-cache the system prompt + always-needed context, then RAG the corpus-specific retrievals on top. Both work in vLLM 0.20+ simultaneously.

Where we got the numbers

Prefix caching support: vLLM 0.6+ release notes (--enable-prefix-caching). KV cache math: GPT-4 architecture paper + community implementations. RAG latency numbers from AnythingLLM + Khoj benchmarks 2026.

Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.