vLLM AsyncEngineDeadError after large batch / OOM
Cause
vLLM's async engine crashed in a background task and won't accept new requests. The most common cause is a CUDA OOM hit during batched scheduling — too many concurrent requests, prompts longer than the configured max_model_len, or KV cache exhaustion under bursty load.
Once the engine dies, every subsequent API call surfaces this error. The original CUDA OOM is in the server logs.
Solution
1. Read the actual cause from server stderr — search for OutOfMemoryError or CUDA error above the AsyncEngineDeadError. The fix depends on which one fired.
2. Reduce concurrency:
vllm serve <model> \
--max-num-seqs 16 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.85
3. Restart the server. vLLM does not auto-recover; you must kill and relaunch:
pkill -f "vllm serve"
vllm serve <model> ...
4. Add KV cache headroom by lowering --gpu-memory-utilization from the default 0.9 to 0.85 — vLLM allocates the difference for activations and overhead, which prevents the trip into OOM territory under bursty load.
Alternative solutions
Pin the engine to a single replica behind a queue (Redis, NATS) so bursts get spread over time instead of concurrently overloading the GPU.
Related errors
Did this fix it?
If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.