Large language models

KV Cache Quantization

Also known as: kv-quant, kv cache quant

KV cache quantization reduces the memory footprint of the key-value (KV) cache by storing its entries in lower-precision formats (e.g., 8-bit or 4-bit integers) instead of full 16-bit or 32-bit floats. During text generation, each new token requires the model to attend to all previous tokens' keys and values; this cache grows linearly with sequence length and can consume multiple gigabytes of VRAM. Quantizing the cache shrinks its size by 2–4×, allowing longer context windows or larger batch sizes on the same hardware, at the cost of minor accuracy loss. Operators encounter this as a runtime option in inference engines like llama.cpp, vLLM, and MLX.

Deeper dive

The KV cache stores the keys and values from the self-attention layers of a transformer model. For a model with 32 layers, 4096 hidden dimensions, and 32 attention heads, each token adds roughly 32 × 2 × 4096 × 2 bytes (FP16) = 0.5 MB to the cache. At a 32K context, that's ~16 GB. KV cache quantization reduces each element to 8 bits (FP8 or INT8) or 4 bits (INT4), cutting memory by 2× or 4×. The quantization can be per-tensor, per-channel, or per-group, with group size (e.g., 64 or 128) affecting accuracy and speed. Some implementations use dynamic quantization (calibrated on the fly) or static quantization (pre-computed scales). The trade-off: lower precision increases perplexity slightly but enables much longer contexts. For example, a 70B model with FP16 KV cache at 128K context would need ~80 GB; with 4-bit quantization, it fits in ~20 GB, making it feasible on a single 24 GB GPU.

Practical example

On an RTX 4090 (24 GB VRAM), running Llama 3.1 70B at Q4_K_M (~40 GB) requires offloading to system RAM, but the KV cache still resides in VRAM. Without quantization, a 32K context cache for 70B takes ~12 GB (FP16), leaving only ~12 GB for model weights—forcing heavy offload. With 4-bit KV cache quantization, the cache shrinks to ~3 GB, freeing 9 GB for model weights, reducing offload and increasing tokens/sec from ~2 to ~8.

Workflow example

In llama.cpp, enable KV cache quantization with --cache-type-k q8_0 or --cache-type-v q8_0 (8-bit) or use --kv-cache-quant 4 for 4-bit. In vLLM, set --kv-cache-dtype fp8 or --kv-cache-dtype int8. In MLX, pass kv_cache_quant=True to generate(). Operators monitor VRAM usage with nvidia-smi or ollama ps; enabling quantization should show a drop in cache memory. For example, ./llama-cli -m model.gguf -c 32768 --cache-type-k q8_0 --cache-type-v q8_0 loads a 70B model with 8-bit KV cache, reducing VRAM usage by ~50% compared to FP16.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

Best GPU for local AI →

When it doesn't work

Deeper dive

Practical example

Workflow example

Related terms