Continuous Batching

Continuous batching (sometimes "iteration-level scheduling") is a serving optimization where new requests join the active batch as soon as one slot finishes, instead of waiting for the whole batch to complete. Pioneered by Orca and now standard in vLLM, TGI, and SGLang.

Compared to static batching, continuous batching delivers 2–10× higher throughput on real workloads where prompts have varying lengths. For local single-user setups, the win is small; the point is keeping a server busy under multi-user load.

Implementation requires per-request KV cache slots and per-iteration scheduling. Not all serving stacks support it — Ollama and llama.cpp's default server are static-batch.

Related terms

See also