fatalEditorialReviewed May 2026

CUDA out of memory — fix the actual problem

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

NVIDIA CUDAPyTorchvLLMComfyUIOllama

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

Model weights + KV cache + activations exceed VRAM

Diagnose

Run `nvidia-smi` while loading. Watch VRAM climb until OOM hits. The number it reports is the actual ceiling. If you're trying to load a 70B Q4 (~40 GB) on a 24 GB card, that's the issue.

Fix

Drop to a smaller quant (Q4_K_M → Q3_K_M halves the VRAM hit on long contexts). Or switch to a smaller model. Or shorten the context window.

Best GPU for local AI →

KV cache grows with context length

Diagnose

Error fires only on long inputs. Try the same prompt at 2K context — works. At 16K — OOM. That's KV cache.

Fix

Lower `--ctx-size` (llama.cpp) or `max_position_embeddings` (Transformers). Or enable `flash-attention` if your runtime supports it (cuts KV cache by ~30% on most models).

Best budget GPU for local AI →

VRAM fragmentation from previous runs

Diagnose

Reload the model fresh; OOM fires immediately even though it 'worked yesterday.' `nvidia-smi --gpu-reset` (only safe when no other process is using the GPU) usually clears it.

Fix

Restart the runtime / reboot. For PyTorch specifically: `torch.cuda.empty_cache()` between runs.

Other GPU consumers (browser tabs, OBS, games) holding VRAM

Diagnose

`nvidia-smi` shows VRAM allocated to processes you didn't expect. Chrome with hardware acceleration can hold 1-2 GB.

Fix

Kill the other GPU consumers. Or set the AI runtime to a different GPU index if you have multiple cards.

Genuinely insufficient VRAM for the workload

Diagnose

All the above don't apply, you've already cut quant + context, and the model legitimately doesn't fit. This is the honest answer 60% of the time.

Fix

Buy more VRAM. The cheapest path to 24 GB is a used RTX 3090 ($700-1,000). The cheapest new path to 16 GB is the RTX 4060 Ti 16 GB ($450-550).

Best GPU for local AI guide →

If this keeps happening — the next decision is hardware

If you keep hitting CUDA out-of-memory on the same workload, the fix is hardware. The guides below frame where 16 GB stops being enough and 24 GB becomes the right answer.

Frequently asked questions

What VRAM do I need to avoid CUDA OOM on a 70B model?

For 70B Q4 inference at typical context (4-8K), 24 GB VRAM is the working minimum. 16 GB cards (4060 Ti 16 GB, 4070 Ti Super) can fit 70B Q4 at very short context (~2K) but you'll OOM at typical agent context windows. 32 GB+ (RTX 5090) gives comfortable headroom.

Can I add VRAM to my GPU?

No. VRAM is soldered. The only way to 'add VRAM' is buy a card with more, or add a second card and use tensor-parallel inference (works in vLLM, llama.cpp, ExLlamaV2).

Does CUDA OOM mean I bought the wrong GPU?

Sometimes. If you're on 8-12 GB and trying to run 13B+ models, yes — the math doesn't work. If you're on 24 GB and only OOMing at 32K+ context, no, you just need to lower context or move to a 32 GB card.

Why does nvidia-smi show VRAM available but PyTorch still OOMs?

PyTorch's caching allocator can fragment VRAM. The total free might be 4 GB, but the largest contiguous block might only be 1 GB — and PyTorch can't use what it can't reserve as a single allocation. `torch.cuda.empty_cache()` helps; restarting the process always helps.

Can I avoid CUDA OOM by using Q2 or IQ2 quants?

Technically yes, but you're trading one problem for another. Below Q3, perplexity degrades sharply and coherence drops on long-context tasks. Q2 can cut VRAM by another 25-30% vs Q4_K_M, but model output quality suffers noticeably. Use Q3_K_M as the aggressive floor — if the model still doesn't fit at Q3_K_M on an acceptable context window, the honest answer is more VRAM.

Does a 16 GB card actually work for 70B models?

Only at Q3_K_M with context capped to 2K tokens — and even then it's fragile. 70B Q4_K_M is roughly 40 GB on disk and needs ~45 GB at runtime including KV cache. A 16 GB card can hold about 35% of that; the rest pages through system RAM which drops tok/s from 15-20 to 2-4. The practical floor for comfortable 70B inference is 24 GB VRAM.

Related troubleshooting

Ollama is slow / running on CPU instead of GPU

Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?