Time to first token (TTFT)

TTFT (time-to-first-token) is the latency between sending a prompt and receiving the first generated token. It's dominated by the prefill phase — the model processing the input prompt before generation begins. For a 1K-token prompt on RTX 4090 + Llama 3.1 8B, TTFT is typically 50-150 ms; for a 32K prompt, it can rise to 1-3 seconds.

Why TTFT matters operationally: chat UX feels broken at TTFT > 1 second; agent loops with frequent short tool-call turns are dominated by TTFT, not decode tok/s. A runtime that wins on decode but loses on prefill (e.g. some llama.cpp configurations) feels unresponsive in agentic workloads even when the steady-state tok/s is competitive.

Optimization levers: PagedAttention (vLLM, SGLang) eliminates prefill recomputation of cached prompts. Speculative decoding can compress TTFT for short outputs (controversial — speculative is more decode-side). Prefix caching (RadixAttention in SGLang) is the single biggest TTFT win for agent workloads where the system prompt is stable across requests. Flash Attention 2/3 reduces prefill compute meaningfully on long-context queries. Quantization choice affects TTFT differently than decode — AWQ-INT4 has slightly slower prefill kernels than FP16 on some models because of dequantization overhead.

Related terms

See also