Frameworks & tools
Request Batching
Request batching packs multiple inference requests into a single forward pass to amortize the cost of loading model weights from VRAM. Since decode is memory-bandwidth-bound, doubling the batch size roughly doubles aggregate tok/s without slowing per-request latency much, until the batch saturates compute.
Static batching (Ollama, llama.cpp default) waits for a fixed number of requests before launching. Continuous batching (vLLM, TGI) joins requests mid-flight. Dynamic batching (TensorRT-LLM) adapts batch size to load.
For single-user local AI, batching is invisible. For multi-user serving, it's the difference between 1 and 50 concurrent users on the same hardware.
Related terms
See also
Reviewed by Fredoline Eruo. See our editorial policy.