Evaluation metrics
Throughput vs Latency
Throughput is aggregate tokens generated per second across all in-flight requests; latency is wall-clock time for a single request (TTFT + total decode time). Optimizing for one trades against the other.
Bigger batch sizes increase throughput but raise per-request latency, because each forward pass takes longer when more requests are packed in. Smaller batches do the opposite.
Local single-user AI almost always cares about latency: TTFT under 500 ms, decode at "feels-instant" speeds (>15 tok/s for chat, >40 for code). Multi-user serving cares about throughput: tokens per dollar per hour. Don't pick a config from a throughput benchmark when you're optimizing latency, or vice versa.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.