RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Errors / Out of memory / CUDA OOM that only happens at long context (KV cache blowup)
Out of memory
Verified by owner

CUDA OOM that only happens at long context (KV cache blowup)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate
By Fredoline Eruo · Last verified May 8, 2026

Cause

Model loads fine and runs short prompts, then OOMs partway into a long conversation or after the prompt grows past a threshold. This is KV cache pressure — KV memory grows linearly with context length, and the runner pre-allocates the worst-case slot for the configured max_model_len.

Quick check: KV_per_1k = 2 × num_layers × num_kv_heads × head_dim × 2 bytes. Llama 3.1 8B at FP16 KV cache costs ~128 MB per 1K tokens. At 128K context, that's 16 GB just for KV — more than the model weights themselves at Q4.

Solution

1. Lower the served context length to something realistic for your VRAM:

# vLLM
vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 16384

# llama.cpp / llama-server
./llama-server -m model.gguf -c 16384

# Ollama (Modelfile)
PARAMETER num_ctx 16384

2. Quantize the KV cache. vLLM, llama.cpp, and SGLang support FP8 or INT4 KV — typically halves or quarters cache memory with minimal quality impact:

# vLLM
vllm serve ... --kv-cache-dtype fp8

# llama.cpp
./llama-server -m model.gguf -c 32768 --cache-type-k q8_0 --cache-type-v q8_0

3. Pick a model with GQA. Models with grouped-query attention (num_kv_heads << num_attention_heads) have 4-8× smaller KV cache. Llama 3.1, Qwen 2.5, Mistral Nemo all use GQA; older Llama 2 and Mistral 7B v0.1 do not.

4. Pre-flight with /will-it-run to compute the max context that fits before you start the server.

Alternative solutions

If you must keep long context: rent an H100 80GB hourly, run the long job, terminate. vllm serve --enable-prefix-caching plus a sticky session helps amortize across requests.

Related errors

  • Ollama: model requires more system memory than is available
  • vLLM AsyncEngineDeadError after large batch / OOM
  • Process killed (OOM killer) when loading large model
  • CUDA out of memory when loading a model
  • vLLM: No available KV cache blocks

Did this fix it?

If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.