Out of memory

Out of memory specifically at long context lengths

torch.cuda.OutOfMemoryError or 'cannot allocate KV cache' at >32K tokens
By Fredoline Eruo · Last verified May 6, 2026

Cause

KV cache memory grows linearly with context length. A model that comfortably runs at 4K context can OOM at 32K because the cache went from 1 GB to 8 GB.

The math: KV cache bytes = 2 × num_layers × num_kv_heads × head_dim × context × bytes_per_element. Llama 3.1 8B at 32K context = ~4 GB just for KV cache, on top of weights.

Solution

Quantize the KV cache — biggest single win:

# llama.cpp — INT8 KV cache halves memory
./main --cache-type-k q8_0 --cache-type-v q8_0

# Or INT4 KV (more aggressive, slight quality cost)
./main --cache-type-k q4_0 --cache-type-v q4_0

Enable Flash Attention if not already on (some runners default it off):

./main --flash-attn

Use a smaller working context. A model that "supports 128K" doesn't mean you have to use it.

Move to a model designed for long context efficiency — Mistral Small 3, Llama 4 Scout (10M context with native efficiency), or Qwen 3 with its sliding window mode.

More VRAM is the only real fix for very-long-context workloads. Calculate your specific scenario at /will-it-run — pick a context where the prediction shows reasonable headroom, not the model's maximum.

Related errors

Did this fix it?

If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.