Decode (Token Generation) — AI glossary

Decode is the second phase of LLM inference: generating one output token at a time, autoregressively. Each decode step does a small matrix-vector multiplication against the full model weights, then samples from the output distribution.

Decode is memory-bandwidth-bound, not compute-bound. The throughput ceiling is model_size / memory_bandwidth. A 7B model in Q4 (5 GB) on an RTX 4090 (1008 GB/s) tops out at ~200 tok/s in theory, with real numbers typically 60–75% of that.

This is why decode tok/s scales with VRAM bandwidth (HBM > GDDR7 > GDDR6X > unified memory) far more than with FLOPS. Batch size and speculative decoding are the main levers to push decode past the bandwidth wall.

Related terms