Latency
Latency measures how fast you get a response. Two metrics matter for local LLMs:
Time to First Token (TTFT) — wall-clock from request to first generated token. Dominated by prefill (compute-bound). Long prompts make TTFT worse linearly. On a 4090, a 1K-token prompt has ~50ms TTFT; a 32K prompt has 1-2 seconds.
Inter-Token Latency — time between consecutive tokens during generation. Inverse of tokens-per-second. Dominated by memory bandwidth in the decode phase.
Distinct from throughput, which measures total tokens-per-second across batched/concurrent requests. A serving system optimized for throughput (vLLM with continuous batching) often has worse single-request latency than a system optimized for latency (ExLlamaV2).
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.