vLLM 0.7.0+ ships meaningful prefix cache improvements. On multi-tenant workloads with shared system prompts (chatbots, customer-facing API gateways with templated preamble), expect 12-18% TTFT reduction vs 0.6.x at the same configuration. The cache hit rate is now reported in the metrics endpoint, making it observable for production tuning.
Default behavior changed: `--enable-prefix-caching` is now opt-in for some model families that had instability in 0.6.x. Re-check your serving config if you upgraded.
▼ OPERATOR ANGLE
**Upgrade path**: pin vLLM to 0.7.2+ for stable behavior. Test prefix cache hit rate via /metrics endpoint — target ≥60% for shared-prompt workloads to justify the memory overhead.
**Re-check your config**: `enable_prefix_caching=True` is the new explicit flag. If you're upgrading from 0.6.x and assumed it was on by default, you may have lost the optimization silently.
**Skip if**: your prompts are highly dynamic (no shared preamble) — prefix cache memory overhead exceeds the benefit.
See [vLLM operational review](/tools/vllm) for production tuning, [serving guidance for text-generation tasks](/tasks/text-generation).