Frameworks & tools
Prefix Caching
Prefix caching stores the KV cache from previous requests so a new request that shares a prefix (system prompt, few-shot examples, conversation history) skips the prefill cost for those tokens. vLLM, SGLang, and TGI all support it; llama.cpp added basic support in mid-2024.
For chat with a long system prompt, prefix caching cuts TTFT by 80%+ on every turn after the first. For RAG with a few-shot template, the same template is paid for once per server lifetime instead of once per request.
Cache invalidation is by exact-prefix match — change a single token in the system prompt and the cache misses. Some implementations hash chunks for partial matching.
Related terms
See also
Reviewed by Fredoline Eruo. See our editorial policy.