Where should I start when local AI breaks?

Run nvidia-smi (or rocm-smi / Activity Monitor on Mac) first to confirm the GPU is alive. Then check whether the runtime sees it (ollama ps, torch.cuda.is_available()). 80% of local AI failures resolve at one of those two steps.

What's the most common local AI error in 2026?

CUDA out of memory — usually because the model is too big for the VRAM tier, or KV cache grows beyond expectations at long context. See /troubleshooting/cuda-out-of-memory.

How do I know if my issue is hardware vs software?

If the same model + same runtime fails on day 2 after working on day 1, it's almost always software (driver update, runtime upgrade, env change). If the model has never run, check VRAM math first — most 'broken' setups are 'undersized' setups.

When is the right answer to upgrade hardware?

When you've ruled out runtime config + driver + thermal + model-size and the math says the workload genuinely doesn't fit. See /guides/best-gpu-for-local-ai for the upgrade path.

Diagnostics · 43 guidesEditorial

Local AI troubleshooting

Fix the most common local AI errors: CUDA out of memory, Ollama running on CPU, ROCm not detected, models crashing mid-inference. Operator-grade diagnostics, real fixes, no copy-paste-from-Reddit guesses.

Most common errors

CUDA out of memory

Ollama is slow / running on CPU instead of GPU

ROCm not detected / AMD GPU not found

WSL2 cannot see GPU / nvidia-smi fails inside WSL

Docker container cannot access GPU / `--gpus all` fails

vLLM: CUDA version mismatch / 'no kernel image is available for execution'

llama.cpp Metal: GGML_ASSERT / mtl_buffer crash on macOS

Ollama: 'address already in use' / port 11434 conflict

GGUF tokenizer mismatch / 'tokenizer model not found'

FlashAttention: 'kernel not supported' / not available on this GPU

torch.cuda.is_available() returns False

ROCm: HSA_STATUS_ERROR / HIP runtime errors during inference

SGLang: server hangs / requests time out

ComfyUI stuck on 'loading' / first run never completes

PyTorch MPS falling back to CPU on Apple Silicon

llama.cpp build failed (CUDA / Metal / Vulkan flags rejected)

WSL2 OOM-killer killing inference / 'Killed' message

NVIDIA driver / CUDA toolkit version mismatch

Windows: CUDA not found / 'Could not load nvcuda.dll'

llama.cpp running too slow / CPU-bound on supposedly-GPU build

MLX: out of memory / 'Failed to allocate memory'

Ollama: 'model not found' / 'pull manifest unknown' errors

HuggingFace download failed / 401 / rate-limit / network error

Tensor parallelism: NCCL crash / 'unable to allocate' / 'distributed init failed'

ExLlamaV2: model not loading / 'Could not find model index' / cache OOM

Quantized model: noticeable quality loss / repetition / coherence drop

Tokenizer mismatch / 'Unknown token' / 'Token ID out of range'

CUDA driver too old / 'CUDA driver version is insufficient'

Python: wheel build failed / 'Failed building wheel for X'

safetensors: 'header validation failed' / 'invalid format'

Model keeps crashing / segfault during inference

Windows LLM install failed / Python CUDA not found

Model loads but generation is slow / tok/s far below expectation

Token generation too slow / low throughput across runtimes

ComfyUI CUDA out of memory

vLLM worker crashed / vLLM scheduler crash

TensorRT-LLM build failed / TensorRT-LLM compilation failed

ONNX Runtime falls back to CPU / ONNX Runtime GPU not used

bitsandbytes: CUDA error / 'CUDA Setup failed despite GPU being available'

HuggingFace 429 Too Many Requests / rate limit exceeded

GGUF corrupt on disk / 'invalid magic number' / 'failed to read model file'

WSL: systemd not running / 'System has not been booted with systemd as init'

FlashAttention build failed on Windows / 'nvcuda.dll not found' / MSVC linker errors

Don't see your error?

Related