Local AI troubleshooting
Fix the most common local AI errors: CUDA out of memory, Ollama running on CPU, ROCm not detected, models crashing mid-inference. Operator-grade diagnostics, real fixes, no copy-paste-from-Reddit guesses.
Most common errors
CUDA out of memory
Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).
Ollama is slow / running on CPU instead of GPU
Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.
ROCm not detected / AMD GPU not found
ROCm is finicky on consumer AMD GPUs in 2026. Here's the install order, the gfx-version override that fixes 80% of detection failures, and when to give up and use Vulkan.
WSL2 cannot see GPU / nvidia-smi fails inside WSL
WSL2 doesn't pass the GPU through unless the host driver is right and the kernel is current. Here's the install order that actually works in 2026, and how to confirm passthrough is live before you waste an afternoon.
Docker container cannot access GPU / `--gpus all` fails
Docker doesn't expose the host GPU by default. The NVIDIA Container Toolkit is the bridge. Here's the install + the runtime config + the four common symptoms that mean it's misconfigured.
vLLM: CUDA version mismatch / 'no kernel image is available for execution'
vLLM ships pre-built wheels against specific CUDA versions. When your system CUDA differs, you get cryptic kernel-image errors. Here's the version matrix and the fix order.
llama.cpp Metal: GGML_ASSERT / mtl_buffer crash on macOS
Most Metal crashes in llama.cpp on Apple Silicon trace to too-aggressive context size, an old GGUF format, or a model whose tensor shape Metal can't kernel. Diagnostic + fix order.
Ollama: 'address already in use' / port 11434 conflict
Ollama defaults to port 11434. When something else is on that port — often a previous Ollama process, Docker container, or another LLM server — startup fails. Here's how to find the squatter and reclaim the port.
GGUF tokenizer mismatch / 'tokenizer model not found'
When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.
FlashAttention: 'kernel not supported' / not available on this GPU
FlashAttention 2 / 3 require specific compute capabilities. Older GPUs and consumer Pascal/Turing cards don't support it. Here's the support matrix and the runtime fallbacks.
torch.cuda.is_available() returns False
PyTorch falsely reporting no CUDA is the most common Python ML setup failure. The cause is almost always: wrong PyTorch wheel for your CUDA version, or a CPU-only build accidentally installed.
ROCm: HSA_STATUS_ERROR / HIP runtime errors during inference
HSA / HIP errors mid-inference on AMD GPUs usually trace to thermal limits, kernel-driver mismatch, or known-bad memory modes on consumer cards. Here's the diagnostic order.
SGLang: server hangs / requests time out
SGLang server hangs on startup or stops responding mid-load mostly trace to: request batching saturation, KV cache miss-sizing, scheduler deadlock, or a runtime-CUDA mismatch. Here's the order.
ComfyUI stuck on 'loading' / first run never completes
ComfyUI hanging on first launch is usually a custom-node conflict, model file corruption, or python env collision with A1111. Bisect via --disable-all-custom-nodes and you'll catch 80% of cases in 30 seconds.
PyTorch MPS falling back to CPU on Apple Silicon
PyTorch on Apple Silicon silently falls back to CPU when an op isn't supported by MPS. Set PYTORCH_ENABLE_MPS_FALLBACK=1 to make it audible, then fix the actual op (cast dtype, disable flash-attention, lower batch).
llama.cpp build failed (CUDA / Metal / Vulkan flags rejected)
Most llama.cpp build failures trace to a missing toolkit (CUDA, Metal, Vulkan SDK), wrong compiler version, or a stale CMake cache. Diagnose in order: PATH first, CMake version second, GCC/MSVC third.
WSL2 OOM-killer killing inference / 'Killed' message
WSL2 inherits a fraction of host RAM by default and won't let processes exceed it. Edit .wslconfig to set `memory=32GB` (or whatever you need) and restart WSL. Then verify with `free -h` inside the distro.
NVIDIA driver / CUDA toolkit version mismatch
When PyTorch / vLLM / a CUDA app errors on 'CUDA driver version is insufficient' or 'no kernel image,' the host driver is too old (or sometimes too new) for the installed toolkit. Read nvidia-smi's max-CUDA, match it.
Windows: CUDA not found / 'Could not load nvcuda.dll'
Windows CUDA loading errors trace to a driver-vs-toolkit version skew, a PATH that doesn't include CUDA bin, or a CPU-only PyTorch wheel. Check nvidia-smi first, then the wheel suffix, then PATH.
llama.cpp running too slow / CPU-bound on supposedly-GPU build
If llama.cpp tok/s is 5-10x lower than expected on your GPU, the build probably defaulted to CPU, the model is partially CPU-offloaded, or flash-attention isn't enabled. Diagnose in 60 seconds with --verbose.
MLX: out of memory / 'Failed to allocate memory'
MLX OOM on Apple Silicon traces to wrong-size model for unified memory, missing wired-memory limit, or memory pressure from other apps. macOS reserves 25-30% for system; the rest is your AI budget.
Ollama: 'model not found' / 'pull manifest unknown' errors
Ollama 'model not found' errors trace to typos in the model name, pulling a model that doesn't exist in the official registry, network blocks on the registry, or pulling from a custom registry without auth.
HuggingFace download failed / 401 / rate-limit / network error
HuggingFace download errors split into auth (gated model, no token), rate-limit (anonymous traffic capped), or network (corporate proxy, country block). Diagnose by HTTP status code, fix per cause.
Tensor parallelism: NCCL crash / 'unable to allocate' / 'distributed init failed'
Multi-GPU tensor-parallel crashes trace to NCCL backend issues (PCIe topology, missing peer access), insufficient GPU pair memory, or tensor-parallel-size not matching GPU count. Diagnose with NCCL_DEBUG=INFO.
ExLlamaV2: model not loading / 'Could not find model index' / cache OOM
ExLlamaV2 load failures trace to wrong model format (needs EXL2 or EXL3, not GGUF), insufficient cache for context, or a driver/runtime version mismatch. The exl2 format is non-negotiable.
Quantized model: noticeable quality loss / repetition / coherence drop
Output quality drop after quantization usually means the bpw is too aggressive, KV cache quantization is too low, or the calibration data didn't match the model. Q4_K_M is the safe floor; below that needs care.
Tokenizer mismatch / 'Unknown token' / 'Token ID out of range'
Tokenizer errors usually mean the loaded tokenizer doesn't match the model weights, the chat template is wrong, or special tokens (BOS/EOS) weren't preserved through quantization. Verify tokenizer config first.
CUDA driver too old / 'CUDA driver version is insufficient'
If PyTorch / vLLM / CUDA app errors with 'driver version insufficient,' your NVIDIA driver predates the CUDA runtime. Driver 555+ supports CUDA 12.4 (the 2026 standard). Update via nvidia.com or distro.
Python: wheel build failed / 'Failed building wheel for X'
Wheel build failures in pip install almost always trace to: missing compiler (gcc / MSVC), missing system headers (Python.h, CUDA), or a Rust-based package without the Rust toolchain. Fix compiler first, then verify wheel availability.
safetensors: 'header validation failed' / 'invalid format'
Safetensors header errors mean the file is corrupted, partially downloaded, or isn't actually a safetensors file. Check file size against the repo, re-download if mismatch, fall back to checked download tools.
Model keeps crashing / segfault during inference
Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.
Windows LLM install failed / Python CUDA not found
Why first-time Windows AI installs fail, how to fix each link in the driver-CUDA-Python chain, and the specific download links that actually work.
Model loads but generation is slow / tok/s far below expectation
When the model loads (no OOM) but token generation is far below expected speeds, the bottleneck is usually VRAM paging, KV cache overcommit, or GPU contention. Here's how to diagnose and fix each.
Token generation too slow / low throughput across runtimes
Slow token generation across multiple runtimes (not specific to Ollama or vLLM) means a system-level bottleneck: GPU underutilization, missing flash-attention, wrong thread count, thermal throttle, or VRAM paging.
ComfyUI CUDA out of memory
ComfyUI-specific CUDA OOM: what triggers it (loaded checkpoints, IPAdapter/ControlNet overhead, missing --lowvram), how to fix it, and the ComfyUI settings that matter.
vLLM worker crashed / vLLM scheduler crash
vLLM worker/scheduler crashes: KV cache fraction misconfiguration, max-model-len exceeding VRAM, worker timeouts, NCCL failures, and quant incompatibility. The exact fix order that production operators use.
TensorRT-LLM build failed / TensorRT-LLM compilation failed
TensorRT-LLM compilation/build failures: missing CUDA arch flag, version mismatches, Python wheel OOM, and NVCC compute capability issues. Honest advice: for most users, vLLM is the saner path.
ONNX Runtime falls back to CPU / ONNX Runtime GPU not used
ONNX Runtime silently falls back to CPU even with a GPU present. Fix the provider registration, package choice, and model export to get GPU inference working.
bitsandbytes: CUDA error / 'CUDA Setup failed despite GPU being available'
bitsandbytes silently breaks after PyTorch or NVIDIA driver updates. The fix is usually a reinstall with the right CUDA version, or switching to a prebuilt wheel. Here's the diagnostic order.
HuggingFace 429 Too Many Requests / rate limit exceeded
HuggingFace returns HTTP 429 when you exceed the anonymous rate limit. A free account + token raises your ceiling dramatically. Here's exactly what triggers it, how to authenticate, and how to batch downloads so you never hit it again.
GGUF corrupt on disk / 'invalid magic number' / 'failed to read model file'
A corrupt GGUF file fails with cryptic magic-number or read errors. Here's how to validate the file without loading it, identify the corruption, and re-download only the damaged parts.
WSL: systemd not running / 'System has not been booted with systemd as init'
WSL2 defaults to a non-systemd init for speed. For Docker, NVIDIA Container Toolkit, and multi-service AI stacks, you need systemd enabled. Here's how to turn it on and verify it's running.
FlashAttention build failed on Windows / 'nvcuda.dll not found' / MSVC linker errors
FlashAttention compilation on Windows is the most common build failure in the local AI stack. The three real fixes: a prebuilt wheel, WSL2, or switching to SDPA.
Don't see your error?
We're building the troubleshooting library by the highest-volume queries first. If you're hitting an error that isn't covered, the diagnostic patterns here usually transfer: check VRAM headroom, check thermals, check driver versions, check the model file. Most local AI failures fall in those four buckets.