CUDA out of memory when loading a model
Cause
The model you're loading needs more VRAM than your card has free. This is the single most common error in local AI. Causes:
- Model size (weights + KV cache + activation buffers) exceeds VRAM
- Another process is holding VRAM (background browser tab, prior Python session)
- Quantization too aggressive for the runner you're using (some runners pad to 8-bit even for Q4 models)
- Context window set higher than VRAM can support
Solution
1. Free other VRAM. Close browser tabs (Chrome eats ~1 GB), close other AI apps, kill stale Python processes (nvidia-smi shows what's using VRAM, kill the offender with kill <PID>).
2. Use a smaller quantization. If you're on Q5_K_M or Q8_0, drop to Q4_K_M. The quality loss is real but small; the VRAM savings are 30-50%.
# Ollama
ollama pull qwen2.5:7b-instruct-q4_K_M
3. Reduce context window. A 7B model at 4K context fits in 8 GB; the same model at 32K context needs 12+ GB because of KV cache growth.
4. Use CPU offload. Move some layers to system RAM. Speed drops but the model fits.
# llama.cpp
./main --n-gpu-layers 28 --model model.gguf
5. Pick a smaller model. Use Will it run? to find a model that fits comfortably on your hardware instead of fighting one that doesn't.
Alternative solutions
If you're on macOS or just got the error during a long-running session: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True sometimes recovers fragmented memory. Restart usually faster.
Related errors
Did this fix it?
If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.