KV Cache
The KV cache stores the key and value tensors from previous attention computations so the model doesn't recompute them at every generated token. Without it, generation speed would be O(n²); with it, each new token is roughly O(n).
The catch: KV cache memory scales linearly with context length. The formula is 2 × num_layers × num_kv_heads × head_dim × context_length × bytes_per_element. For Llama 3.3 70B at FP16, every 1K tokens of context costs about 320 MB of VRAM.
Modern models use Grouped-Query Attention (GQA), where num_kv_heads << num_attention_heads, dramatically reducing cache size. Llama 3.1 8B has 32 attention heads but only 8 KV heads — a 4× cache reduction over old MHA architectures. Quantized KV cache (FP8 or INT4) halves or quarters this further.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.