vLLM: No available KV cache blocks
Cause
vLLM pre-allocates KV cache blocks at startup based on gpu_memory_utilization (default 0.9). Once running, requests with long prompts can exhaust the pre-allocated pool — vLLM doesn't dynamically grow it.
A common scenario: running a 14B model on 24 GB VRAM at 90% utilization leaves only enough KV cache for ~8K combined tokens across all concurrent requests. The 9th 4K-prompt request errors.
Solution
Lower model max length to free more cache:
vllm serve qwen2.5-7b --max-model-len 16384
Increase gpu_memory_utilization (more KV cache, less safety margin):
vllm serve qwen2.5-7b --gpu-memory-utilization 0.95
Risk: leaves no room for activation memory spikes; can OOM on bursty load.
Add swap_space for CPU offload of cache:
vllm serve qwen2.5-7b --swap-space 8 # 8 GB
Hot blocks stay in VRAM, cold blocks evict to system RAM. Slight latency hit, more capacity.
Reduce max_num_seqs to limit concurrency:
vllm serve qwen2.5-7b --max-num-seqs 16
Use a smaller model if you genuinely need to serve many concurrent users. A 7B model at Q4 with 24 GB VRAM happily serves 32 concurrent 4K-context users; a 14B at FP16 won't.
Related errors
Did this fix it?
If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.