Transformer & LLM components
Decode (Token Generation)
Decode is the second phase of LLM inference: generating one output token at a time, autoregressively. Each decode step does a small matrix-vector multiplication against the full model weights, then samples from the output distribution.
Decode is memory-bandwidth-bound, not compute-bound. The throughput ceiling is model_size / memory_bandwidth. A 7B model in Q4 (5 GB) on an RTX 4090 (1008 GB/s) tops out at ~200 tok/s in theory, with real numbers typically 60–75% of that.
This is why decode tok/s scales with VRAM bandwidth (HBM > GDDR7 > GDDR6X > unified memory) far more than with FLOPS. Batch size and speculative decoding are the main levers to push decode past the bandwidth wall.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.