fatalEditorialReviewed May 2026

vLLM CUDA version mismatch — pin the right wheel

vLLM ships pre-built wheels against specific CUDA versions. When your system CUDA differs, you get cryptic kernel-image errors. Here's the version matrix and the fix order.

vLLMPyTorchCUDA ToolkitNVIDIA driver

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

vLLM wheel built for a CUDA version your driver can't run

Diagnose

`pip install vllm` succeeds. Loading a model errors with 'no kernel image is available for execution on the device' or 'CUDA error: unsupported PTX version.'

Fix

Check `nvidia-smi` (top-right shows max CUDA the driver supports). Match wheel: `pip install vllm --index-url https://pypi.org/simple` is fine if your driver supports CUDA 12.4+. For older drivers, install `vllm==0.6.x` which built against CUDA 12.1.

PyTorch + vLLM compiled against different CUDA versions

Diagnose

`python -c 'import torch; print(torch.version.cuda)'` shows e.g. `12.4`. vLLM expects 12.1. ABI mismatch causes load-time failures.

Fix

Reinstall PyTorch matching vLLM's required CUDA: `pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121` (or cu124 to match vLLM head). Always pin both in lockstep.

GPU compute capability not supported by this vLLM build

Diagnose

Older vLLM versions don't support Blackwell (sm_120) or H100 (sm_90) by default. Error mentions PTX or compute capability.

Fix

Upgrade to vLLM 0.6.5+ for Blackwell support. For Hopper (H100), 0.5.0+ is fine. Verify with `nvidia-smi --query-gpu=compute_cap --format=csv,noheader`.

Conda + pip mixed install corrupted CUDA env

Diagnose

`conda install pytorch` then `pip install vllm` produces conflicting CUDA libraries in the env path.

Fix

Start fresh: `conda create -n vllm python=3.11 && conda activate vllm`. Use ONLY pip for vLLM + PyTorch (avoid the conda PyTorch channel if you'll add vLLM later). Alternative: use `uv` for the full dependency tree.

If this keeps happening — the next decision is hardware

vLLM CUDA version drift is solvable in software, but if you're chasing the latest CUDA-major releases your build is your bottleneck. The guide below frames the developer-specific hardware decision.

AI PC build for developers

Frequently asked questions

What's the recommended Python + CUDA combo for vLLM in 2026?

Python 3.11, CUDA 12.4, NVIDIA driver 550+. This is the path with the broadest ecosystem support: vLLM, PyTorch, Transformers, TensorRT-LLM all have wheels built against 12.4 in 2026.

Why does vLLM care so much about CUDA version when llama.cpp doesn't?

vLLM ships pre-built CUDA kernels (custom attention, paged KV cache) that depend on the CUDA runtime. llama.cpp compiles CUDA on demand or ships generic kernels. The trade-off: vLLM is faster on supported configs, llama.cpp is more portable.

Can I run vLLM on a card without CUDA (AMD, Apple)?

vLLM has experimental ROCm support for AMD via `--device rocm`. Apple Metal is not supported as of 2026. For Apple Silicon, llama.cpp or MLX are the paths.

Related troubleshooting

torch.cuda.is_available() returns False

PyTorch falsely reporting no CUDA is the most common Python ML setup failure. The cause is almost always: wrong PyTorch wheel for your CUDA version, or a CPU-only build accidentally installed.

CUDA out of memory

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

FlashAttention: 'kernel not supported' / not available on this GPU

FlashAttention 2 / 3 require specific compute capabilities. Older GPUs and consumer Pascal/Turing cards don't support it. Here's the support matrix and the runtime fallbacks.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?