Multi-Head Latent Attention (MLA)
Also known as: latent attention, deepseek attention
Multi-Head Latent Attention (MLA) is an attention mechanism used in DeepSeek V2/V3 that compresses the key-value (KV) cache into a lower-dimensional latent space. Instead of storing full key and value vectors for each token, MLA stores a compressed latent vector and reconstructs the keys and values on the fly during generation. This drastically reduces KV cache memory usage—by 75–90%—while maintaining model quality. For operators, this means larger context windows or smaller VRAM footprints when running DeepSeek models locally.
Deeper dive
Standard multi-head attention stores separate key and value matrices for each attention head, leading to a KV cache that grows linearly with sequence length and number of heads. MLA introduces a down-projection matrix that maps the key and value into a shared latent space of much lower dimension. During inference, the latent vector is stored per token, and the full keys/values are reconstructed via up-projection matrices before computing attention scores. This reduces the per-token KV cache size from 2 * n_heads * d_head to d_latent (typically 512–1024). DeepSeek V2 uses MLA with a latent dimension of 512, compared to standard KV dimension of 128 per head across 64 heads (8192 total). The reconstruction introduces negligible overhead (a few extra matrix multiplies) but yields substantial memory savings, enabling 128K+ context windows on consumer GPUs.
Practical example
DeepSeek V2 has 64 attention heads with a head dimension of 128, so standard KV cache per token would be 2 * 64 * 128 = 16,384 floats. With MLA using a latent dimension of 512, the cache drops to 512 floats per token—a 32× reduction. For a 128K context, that's ~2 GB vs ~64 GB of VRAM. On an RTX 4090 (24 GB), the standard approach would be impossible, but MLA fits comfortably, leaving room for model weights.
Workflow example
When running DeepSeek V2 in llama.cpp or vLLM, MLA is handled automatically by the runtime. Operators don't need to configure anything—the model architecture defines the latent dimension. However, monitoring VRAM usage with nvidia-smi or ollama ps will show significantly lower KV cache consumption compared to a non-MLA model of similar size. For example, loading DeepSeek V2 with a 128K context in Ollama might use ~6 GB for the KV cache instead of ~64 GB, making it feasible on a single RTX 4090.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.