Tokens per second

Tokens per second (tok/s) is the most-cited LLM throughput metric, but it's also the most-misunderstood. It splits into two distinct phases: prefill (processing the input prompt — typically 100-1000+ tok/s on modern hardware) and decode (generating output tokens — typically 10-200 tok/s). When operators say "tok/s," they usually mean decode tok/s, which is the user-visible streaming speed.

What tok/s doesn't tell you: TTFT (time to first token), context-degradation behavior (how much does throughput drop at 32K vs 1K context?), concurrency scaling (does throughput hold at 8 concurrent users?), thermal-throttle curves (does sustained-load tok/s match cold-boot tok/s?). A model rated at "60 tok/s on RTX 4090" could mean any of these depending on prompt length, batch size, quant, runtime, and system load.

Operator discipline: when you read a tok/s benchmark, ask: (1) is it measured or estimated? (2) what's the prompt + output length? (3) what's the batch size + concurrency? (4) what runtime + quant + flash-attention version? Without these, the number is a vibe, not a measurement. RunLocalAI's benchmark queue tracks pending measurements with full provenance fields; published benchmarks carry confidence labels (high/medium/low/unverified).

Related terms

See also