Prefix Caching

Prefix caching stores the KV cache from previous requests so a new request that shares a prefix (system prompt, few-shot examples, conversation history) skips the prefill cost for those tokens. vLLM, SGLang, and TGI all support it; llama.cpp added basic support in mid-2024.

For chat with a long system prompt, prefix caching cuts TTFT by 80%+ on every turn after the first. For RAG with a few-shot template, the same template is paid for once per server lifetime instead of once per request.

Cache invalidation is by exact-prefix match — change a single token in the system prompt and the cache misses. Some implementations hash chunks for partial matching.

Related terms

See also