Request Batching

Request batching packs multiple inference requests into a single forward pass to amortize the cost of loading model weights from VRAM. Since decode is memory-bandwidth-bound, doubling the batch size roughly doubles aggregate tok/s without slowing per-request latency much, until the batch saturates compute.

Static batching (Ollama, llama.cpp default) waits for a fixed number of requests before launching. Continuous batching (vLLM, TGI) joins requests mid-flight. Dynamic batching (TensorRT-LLM) adapts batch size to load.

For single-user local AI, batching is invisible. For multi-user serving, it's the difference between 1 and 50 concurrent users on the same hardware.

Related terms

See also