fatalEditorialReviewed May 2026

vLLM worker crash — fix the KV cache and scheduler config

vLLM worker/scheduler crashes: KV cache fraction misconfiguration, max-model-len exceeding VRAM, worker timeouts, NCCL failures, and quant incompatibility. The exact fix order that production operators use.

vLLMNVIDIA CUDAPython

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

KV cache fraction too high (gpu_memory_utilization overshoots available VRAM)

Diagnose

Crash on startup or first request. Logs show: `Worker failed to start due to memory allocation failure` or `CUDA out of memory during KV cache allocation`. The `--gpu-memory-utilization` default (0.90) is too aggressive for your model+context combo.

Fix

Lower `--gpu-memory-utilization` to 0.85 or 0.80. The model weights load first; the KV cache gets what's left. On a 24 GB card with 40 GB model, there is no 'left' — you need a larger card or a smaller model. Drop `--max-model-len` to reduce KV cache reservation.

max-model-len setting exceeds available VRAM for model + KV cache

Diagnose

Crash specifically when a request hits the configured `max_model_len`. The scheduler reserved enough KV cache for long sequences, but there's not enough room after model weights. Logs show `OutOfMemoryError` during block allocation.

Fix

Lower `--max-model-len` to a value the card can actually serve. Rule of thumb: model weights + (max_model_len × KV cache per token × layers) must fit. Test with short context first (`--max-model-len 2048`), then increment until you find the ceiling.

Worker timeout under load (scheduler kills slow workers)

Diagnose

Worker dies mid-generation during a burst of concurrent requests. Logs show `Worker crashed or timed out` or `AsyncEngineDead`. The worker wasn't dead — it was just slow under load and the scheduler's timeout fired.

Fix

Increase `VLLM_WORKER_TIMEOUT` (default 600s) in your environment variables. Reduce `--max-num-seqs` to limit concurrent sequences. If the timeout fires because generate takes > 10 minutes, reduce `--max-model-len` or increase `--tensor-parallel-size` to speed up the worker.

NCCL crash on multi-GPU (tensor-parallel setup)

Diagnose

Crash only on multi-GPU. Logs show `NCCL error` or `Watchdog caught collective operation timeout`. Often due to PCIe topology issues or mismatched NCCL versions between GPUs.

Fix

Set `NCCL_DEBUG=INFO` to see which operation fails. Set `NCCL_P2P_DISABLE=1` if GPUs are on different PCIe switches. Ensure all GPUs run the same driver version and CUDA version. For consumer cards (3090/4090) in P2P: set `NCCL_IB_DISABLE=1` and use `--tensor-parallel-size` ≤ GPU count on the same NUMA node.

AWQ / GPTQ quant incompatibility with the vLLM version

Diagnose

Crash happens immediately on model load with an AWQ or GPTQ quantized model. Logs show `KeyError` on layer weights or `mismatched tensor shapes`. The quant format was compiled for a different version of the quant library.

Fix

Try the unquantized version to confirm the model works. For AWQ: use vLLM ≥ 0.5.0 (native AWQ support). For GPTQ: ensure you're using `--quantization gptq` and the model was quantized with the matching GPTQ-for-LLaMA version. As a fallback, switch to FP16 and add a GPU — quantization complexity is often not worth the VRAM savings in production.

Frequently asked questions

Why does vLLM crash but llama.cpp runs the same model fine?

vLLM's scheduler pre-allocates KV cache at startup based on `--max-model-len` and `--gpu-memory-utilization`. llama.cpp allocates on demand. If vLLM's pre-allocation overcommits, it crashes before serving. Lower `--gpu-memory-utilization` and `--max-model-len` until stable.

How do I debug vLLM worker crashes efficiently?

Start with `--enforce-eager` (disables CUDA graphs, slower but isolates graph-compilation bugs). Set `VLLM_LOGGING_LEVEL=DEBUG`. Run with `--max-num-seqs 1` and one request to isolate the crash. If it works with `--enforce-eager`, the issue is CUDA graph compilation.

Is tensor parallelism worth the complexity for a 2x consumer GPU setup?

For a single 70B model on 2x 24 GB: yes — it's the difference between 2 tok/s (paging from RAM) and 20-30 tok/s (fully in VRAM). For 7-13B models on 2x 24 GB: no — run separate instances on each GPU instead. The complexity ceiling is real: NCCL on consumer cards through PCIe is not as stable as NVLink on data-center cards.

What's the difference between a worker crash and a scheduler crash in vLLM?

A worker crash is the inference process itself dying (OOM, CUDA error, NCCL failure). The scheduler in vLLM's async engine detects the dead worker and reports 'Worker crashed or timed out.' A scheduler crash is the orchestration layer dying (deadlock, race condition, timeout while waiting for a worker). Worker crashes are common and usually a resource/config problem. Scheduler crashes are rarer and usually a vLLM bug or extreme load scenario.

How do I set up health checks so vLLM restarts automatically when a worker crashes?

Wrap vLLM in a process manager. Simplest: `while true; do python -m vllm.entrypoints.openai.api_server ...; sleep 5; done`. More robust: systemd service with `Restart=always` and `RestartSec=10s`. Production-level: Docker with `--restart unless-stopped` and a health check endpoint: `HEALTHCHECK --interval=30s CMD curl -f http://localhost:8000/health || exit 1`. The `/health` endpoint returns 200 when the async engine is alive and processing.

Can I recover from a worker crash without dropping inflight requests?

No — vLLM doesn't persist generation state across worker crashes. When a worker dies, all in-flight requests are lost. The client must retry. Design clients with idempotent retry logic: set `max_retries=3` on the OpenAI client, use exponential backoff, and cache the generated tokens on the client side so you can resume from the last complete token. For production serving, pair vLLM with a load balancer that detects `/health` failures and reroutes traffic.

Related troubleshooting

CUDA out of memory

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

Token generation too slow / low throughput across runtimes

Slow token generation across multiple runtimes (not specific to Ollama or vLLM) means a system-level bottleneck: GPU underutilization, missing flash-attention, wrong thread count, thermal throttle, or VRAM paging.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?