What causes "CUDA out of memory when loading a model"?

The model you're loading needs more VRAM than your card has free. This is the single most common error in local AI. Causes: - Model size (weights + KV cache + activation buffers) exceeds VRAM - Another process is holding VRAM (background browser tab, prior Python session) - Quantization too aggressive for the runner you're using (some runners pad to 8-bit even for Q4 models) - Context window set higher than VRAM can support

RunLocalAI

Out of memory

Verified by owner

CUDA out of memory when loading a model

Q: How do you fix "CUDA out of memory when loading a model"?

**1. Free other VRAM.** Close browser tabs (Chrome eats ~1 GB), close other AI apps, kill stale Python processes (`nvidia-smi` shows what's using VRAM, kill the offender with `kill `). **2. Use a smaller quantization.** If you're on Q5_K_M or Q8_0, drop to Q4_K_M. The quality loss is real but small; the VRAM savings are 30-50%. ```bash # Ollama ollama pull qwen2.5:7b-instruct-q4_K_M ``` **3. Reduce context window.** A 7B model at 4K context fits in 8 GB; the same model at 32K context needs 12+ GB because of KV cache growth. **4. Use CPU offload.** Move some layers to system RAM. Speed drops but the model fits. ```bash # llama.cpp ./main --n-gpu-layers 28 --model model.gguf ``` **5. Pick a smaller model.** Use [Will it run?](/will-it-run) to find a model that fits comfortably on your hardware instead of fighting one that doesn't.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X.XX GiB

By Fredoline Eruo · Last verified May 6, 2026

Cause

The model you're loading needs more VRAM than your card has free. This is the single most common error in local AI. Causes:

Model size (weights + KV cache + activation buffers) exceeds VRAM
Another process is holding VRAM (background browser tab, prior Python session)
Quantization too aggressive for the runner you're using (some runners pad to 8-bit even for Q4 models)
Context window set higher than VRAM can support

Solution

1. Free other VRAM. Close browser tabs (Chrome eats ~1 GB), close other AI apps, kill stale Python processes (nvidia-smi shows what's using VRAM, kill the offender with kill <PID>).

2. Use a smaller quantization. If you're on Q5_K_M or Q8_0, drop to Q4_K_M. The quality loss is real but small; the VRAM savings are 30-50%.

# Ollama
ollama pull qwen2.5:7b-instruct-q4_K_M

3. Reduce context window. A 7B model at 4K context fits in 8 GB; the same model at 32K context needs 12+ GB because of KV cache growth.

4. Use CPU offload. Move some layers to system RAM. Speed drops but the model fits.

# llama.cpp
./main --n-gpu-layers 28 --model model.gguf

5. Pick a smaller model. Use Will it run? to find a model that fits comfortably on your hardware instead of fighting one that doesn't.

Alternative solutions

If you're on macOS or just got the error during a long-running session: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True sometimes recovers fragmented memory. Restart usually faster.

Related errors

Did this fix it?

If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.