Python wheel build failed — fix compiler, deps, or fall back to prebuilt
Wheel build failures in pip install almost always trace to: missing compiler (gcc / MSVC), missing system headers (Python.h, CUDA), or a Rust-based package without the Rust toolchain. Fix compiler first, then verify wheel availability.
Diagnostic order — most likely first
Compiler missing (gcc on Linux, MSVC on Windows)
Error: 'gcc not found' or 'cl.exe not found.' Pip can't compile C extensions.
Linux: `sudo apt install build-essential python3-dev`. macOS: `xcode-select --install`. Windows: install Visual Studio Build Tools 2022 (workload: Desktop development with C++).
CUDA headers missing for CUDA-using build
Error: 'cuda.h not found' or 'cuda_runtime.h missing.' Building flash-attn / xformers / custom CUDA without the toolkit installed.
Install CUDA Toolkit (not just driver): `sudo apt install nvidia-cuda-toolkit` (Linux) or download CUDA 12.4 installer from nvidia.com (Windows). Set `CUDA_HOME=/usr/local/cuda-12.4` before retrying.
Rust toolchain missing for Rust-based build (tokenizers, hf-transfer)
Error mentions 'cargo not found' or 'rustc missing.' Hugging Face's `tokenizers` and `hf-transfer` packages are Rust-based.
Install Rust: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh` (Linux/macOS) or use rustup-init.exe on Windows. Source the cargo env and retry.
Wheel exists for your platform but pip didn't find it
Falling back to source build means pip didn't find a precompiled wheel. Check the project's release page for available wheels.
Specify the index explicitly. For flash-attn: `pip install flash-attn --no-build-isolation`. For vllm: `pip install vllm` should pull a wheel; if not, your Python/CUDA combo isn't precompiled.
Insufficient RAM during compile (parallel jobs OOM)
Build starts then OOMs. Each gcc parallel job uses 1-2 GB. Default `-j` count too high for available RAM.
Lower parallelism: `MAX_JOBS=4 pip install flash-attn --no-build-isolation` (or even MAX_JOBS=2 on tight RAM). Compilation takes longer but won't OOM.
ABI mismatch with installed PyTorch
Build succeeds but import fails: 'undefined symbol' or 'ABI mismatch.' The wheel was built against a different PyTorch version.
Match versions. Reinstall PyTorch first to a known-good version, then rebuild the dependent package against it. Or use `--no-cache-dir` to force fresh build.
Frequently asked questions
Why doesn't pip find a prebuilt wheel for my platform?
Wheels are platform-specific (OS + Python version + CPU arch). Bleeding-edge Python (3.13+) or unusual platforms (ARM Linux) often don't have prebuilt wheels for niche packages. The fallback is source build.
Should I use uv instead of pip?
Yes for most workflows. uv resolves + installs 10-100x faster than pip, has better dependency resolution, and ships better error messages. `pip install uv` then use `uv pip install <package>`.
Why does flash-attn always need to compile?
Flash-attn ships prebuilt wheels for common combos (CUDA 12.4 + Python 3.10/3.11/3.12 + Torch 2.4/2.5) but not all. If your combo isn't matched, you compile from source. Takes 5-15 minutes; uses significant RAM.
Related troubleshooting
PyTorch falsely reporting no CUDA is the most common Python ML setup failure. The cause is almost always: wrong PyTorch wheel for your CUDA version, or a CPU-only build accidentally installed.
vLLM ships pre-built wheels against specific CUDA versions. When your system CUDA differs, you get cryptic kernel-image errors. Here's the version matrix and the fix order.
Most llama.cpp build failures trace to a missing toolkit (CUDA, Metal, Vulkan SDK), wrong compiler version, or a stale CMake cache. Diagnose in order: PATH first, CMake version second, GCC/MSVC third.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: