vLLM CUDA version mismatch — pin the right wheel
vLLM ships pre-built wheels against specific CUDA versions. When your system CUDA differs, you get cryptic kernel-image errors. Here's the version matrix and the fix order.
Diagnostic order — most likely first
vLLM wheel built for a CUDA version your driver can't run
`pip install vllm` succeeds. Loading a model errors with 'no kernel image is available for execution on the device' or 'CUDA error: unsupported PTX version.'
Check `nvidia-smi` (top-right shows max CUDA the driver supports). Match wheel: `pip install vllm --index-url https://pypi.org/simple` is fine if your driver supports CUDA 12.4+. For older drivers, install `vllm==0.6.x` which built against CUDA 12.1.
PyTorch + vLLM compiled against different CUDA versions
`python -c 'import torch; print(torch.version.cuda)'` shows e.g. `12.4`. vLLM expects 12.1. ABI mismatch causes load-time failures.
Reinstall PyTorch matching vLLM's required CUDA: `pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121` (or cu124 to match vLLM head). Always pin both in lockstep.
GPU compute capability not supported by this vLLM build
Older vLLM versions don't support Blackwell (sm_120) or H100 (sm_90) by default. Error mentions PTX or compute capability.
Upgrade to vLLM 0.6.5+ for Blackwell support. For Hopper (H100), 0.5.0+ is fine. Verify with `nvidia-smi --query-gpu=compute_cap --format=csv,noheader`.
Conda + pip mixed install corrupted CUDA env
`conda install pytorch` then `pip install vllm` produces conflicting CUDA libraries in the env path.
Start fresh: `conda create -n vllm python=3.11 && conda activate vllm`. Use ONLY pip for vLLM + PyTorch (avoid the conda PyTorch channel if you'll add vLLM later). Alternative: use `uv` for the full dependency tree.
vLLM CUDA version drift is solvable in software, but if you're chasing the latest CUDA-major releases your build is your bottleneck. The guide below frames the developer-specific hardware decision.
Frequently asked questions
What's the recommended Python + CUDA combo for vLLM in 2026?
Python 3.11, CUDA 12.4, NVIDIA driver 550+. This is the path with the broadest ecosystem support: vLLM, PyTorch, Transformers, TensorRT-LLM all have wheels built against 12.4 in 2026.
Why does vLLM care so much about CUDA version when llama.cpp doesn't?
vLLM ships pre-built CUDA kernels (custom attention, paged KV cache) that depend on the CUDA runtime. llama.cpp compiles CUDA on demand or ships generic kernels. The trade-off: vLLM is faster on supported configs, llama.cpp is more portable.
Can I run vLLM on a card without CUDA (AMD, Apple)?
vLLM has experimental ROCm support for AMD via `--device rocm`. Apple Metal is not supported as of 2026. For Apple Silicon, llama.cpp or MLX are the paths.
Related troubleshooting
PyTorch falsely reporting no CUDA is the most common Python ML setup failure. The cause is almost always: wrong PyTorch wheel for your CUDA version, or a CPU-only build accidentally installed.
Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).
FlashAttention 2 / 3 require specific compute capabilities. Older GPUs and consumer Pascal/Turing cards don't support it. Here's the support matrix and the runtime fallbacks.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: