RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Multi-Head Latent Attention (MLA)
Transformer & LLM components

Multi-Head Latent Attention (MLA)

Also known as: latent attention, deepseek attention

Multi-Head Latent Attention (MLA) is an attention mechanism used in DeepSeek V2/V3 that compresses the key-value (KV) cache into a lower-dimensional latent space. Instead of storing full key and value vectors for each token, MLA stores a compressed latent vector and reconstructs the keys and values on the fly during generation. This drastically reduces KV cache memory usage—by 75–90%—while maintaining model quality. For operators, this means larger context windows or smaller VRAM footprints when running DeepSeek models locally.

Deeper dive

Standard multi-head attention stores separate key and value matrices for each attention head, leading to a KV cache that grows linearly with sequence length and number of heads. MLA introduces a down-projection matrix that maps the key and value into a shared latent space of much lower dimension. During inference, the latent vector is stored per token, and the full keys/values are reconstructed via up-projection matrices before computing attention scores. This reduces the per-token KV cache size from 2 * n_heads * d_head to d_latent (typically 512–1024). DeepSeek V2 uses MLA with a latent dimension of 512, compared to standard KV dimension of 128 per head across 64 heads (8192 total). The reconstruction introduces negligible overhead (a few extra matrix multiplies) but yields substantial memory savings, enabling 128K+ context windows on consumer GPUs.

Practical example

DeepSeek V2 has 64 attention heads with a head dimension of 128, so standard KV cache per token would be 2 * 64 * 128 = 16,384 floats. With MLA using a latent dimension of 512, the cache drops to 512 floats per token—a 32× reduction. For a 128K context, that's ~2 GB vs ~64 GB of VRAM. On an RTX 4090 (24 GB), the standard approach would be impossible, but MLA fits comfortably, leaving room for model weights.

Workflow example

When running DeepSeek V2 in llama.cpp or vLLM, MLA is handled automatically by the runtime. Operators don't need to configure anything—the model architecture defines the latent dimension. However, monitoring VRAM usage with nvidia-smi or ollama ps will show significantly lower KV cache consumption compared to a non-MLA model of similar size. For example, loading DeepSeek V2 with a 128K context in Ollama might use ~6 GB for the KV cache instead of ~64 GB, making it feasible on a single RTX 4090.

Related terms

KV CacheMulti-Query Attention (MQA)DeepSeekGrouped-Query Attention (GQA)

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →