Out of memory
Verified by owner

CUDA out of memory when loading a model

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X.XX GiB
By Fredoline Eruo · Last verified May 6, 2026

Cause

The model you're loading needs more VRAM than your card has free. This is the single most common error in local AI. Causes:

  • Model size (weights + KV cache + activation buffers) exceeds VRAM
  • Another process is holding VRAM (background browser tab, prior Python session)
  • Quantization too aggressive for the runner you're using (some runners pad to 8-bit even for Q4 models)
  • Context window set higher than VRAM can support

Solution

1. Free other VRAM. Close browser tabs (Chrome eats ~1 GB), close other AI apps, kill stale Python processes (nvidia-smi shows what's using VRAM, kill the offender with kill <PID>).

2. Use a smaller quantization. If you're on Q5_K_M or Q8_0, drop to Q4_K_M. The quality loss is real but small; the VRAM savings are 30-50%.

# Ollama
ollama pull qwen2.5:7b-instruct-q4_K_M

3. Reduce context window. A 7B model at 4K context fits in 8 GB; the same model at 32K context needs 12+ GB because of KV cache growth.

4. Use CPU offload. Move some layers to system RAM. Speed drops but the model fits.

# llama.cpp
./main --n-gpu-layers 28 --model model.gguf

5. Pick a smaller model. Use Will it run? to find a model that fits comfortably on your hardware instead of fighting one that doesn't.

Alternative solutions

If you're on macOS or just got the error during a long-running session: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True sometimes recovers fragmented memory. Restart usually faster.

Related errors

Did this fix it?

If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.