RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Large language models / KV Cache Quantization
Large language models

KV Cache Quantization

Also known as: kv-quant, kv cache quant

KV cache quantization reduces the memory footprint of the key-value (KV) cache by storing its entries in lower-precision formats (e.g., 8-bit or 4-bit integers) instead of full 16-bit or 32-bit floats. During text generation, each new token requires the model to attend to all previous tokens' keys and values; this cache grows linearly with sequence length and can consume multiple gigabytes of VRAM. Quantizing the cache shrinks its size by 2–4×, allowing longer context windows or larger batch sizes on the same hardware, at the cost of minor accuracy loss. Operators encounter this as a runtime option in inference engines like llama.cpp, vLLM, and MLX.

Deeper dive

The KV cache stores the keys and values from the self-attention layers of a transformer model. For a model with 32 layers, 4096 hidden dimensions, and 32 attention heads, each token adds roughly 32 × 2 × 4096 × 2 bytes (FP16) = 0.5 MB to the cache. At a 32K context, that's ~16 GB. KV cache quantization reduces each element to 8 bits (FP8 or INT8) or 4 bits (INT4), cutting memory by 2× or 4×. The quantization can be per-tensor, per-channel, or per-group, with group size (e.g., 64 or 128) affecting accuracy and speed. Some implementations use dynamic quantization (calibrated on the fly) or static quantization (pre-computed scales). The trade-off: lower precision increases perplexity slightly but enables much longer contexts. For example, a 70B model with FP16 KV cache at 128K context would need ~80 GB; with 4-bit quantization, it fits in ~20 GB, making it feasible on a single 24 GB GPU.

Practical example

On an RTX 4090 (24 GB VRAM), running Llama 3.1 70B at Q4_K_M (~40 GB) requires offloading to system RAM, but the KV cache still resides in VRAM. Without quantization, a 32K context cache for 70B takes ~12 GB (FP16), leaving only ~12 GB for model weights—forcing heavy offload. With 4-bit KV cache quantization, the cache shrinks to ~3 GB, freeing 9 GB for model weights, reducing offload and increasing tokens/sec from ~2 to ~8.

Workflow example

In llama.cpp, enable KV cache quantization with --cache-type-k q8_0 or --cache-type-v q8_0 (8-bit) or use --kv-cache-quant 4 for 4-bit. In vLLM, set --kv-cache-dtype fp8 or --kv-cache-dtype int8. In MLX, pass kv_cache_quant=True to generate(). Operators monitor VRAM usage with nvidia-smi or ollama ps; enabling quantization should show a drop in cache memory. For example, ./llama-cli -m model.gguf -c 32768 --cache-type-k q8_0 --cache-type-v q8_0 loads a 70B model with 8-bit KV cache, reducing VRAM usage by ~50% compared to FP16.

Related terms

KV CacheContext WindowVRAM (Video RAM)Quantization

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
When it doesn't work
  • Quantization quality loss →
  • GGUF tokenizer mismatch →