degradesEditorialReviewed May 2026

Model loads but tok/s is terrible — find the bottleneck

When the model loads (no OOM) but token generation is far below expected speeds, the bottleneck is usually VRAM paging, KV cache overcommit, or GPU contention. Here's how to diagnose and fix each.

NVIDIA CUDAAMD ROCmllama.cppOllamavLLMLM Studio
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

Model exceeds VRAM and is paging from system RAM

Diagnose

Run `nvidia-smi -l 1` during inference. If VRAM is pegged at 100% and GPU utilization oscillates (not steady 85-98%), the runtime is swapping layers to system RAM. Token rate drops 10-50x when paging.

Fix

Drop to a smaller quant (Q4_K_M → Q3_K_M or IQ3_XXS). Or switch to a smaller model. Or enable layer offloading that keeps the critical layers in VRAM (Ollama `num_gpu` setting, llama.cpp `-ngl` flag).

#2

KV cache grows with context and pushes model weights to RAM

Diagnose

Speed is fine for short prompts (2K tokens), degrades sharply at 8K+. `nvidia-smi` shows VRAM usage climbing with each message sent.

Fix

Lower context size (`--ctx-size 4096` in llama.cpp, `num_ctx` in Ollama, `max_model_len` in vLLM). Enable flash-attention if your runtime supports it — it compresses the KV cache footprint by ~30-50%.

#3

Other GPU consumers holding VRAM (Chrome, OBS, Discord)

Diagnose

`nvidia-smi` shows processes besides your runtime holding 1-3 GB of VRAM. Chrome with hardware acceleration is the usual suspect.

Fix

Close GPU-heavy apps. Disable hardware acceleration in Chrome (Settings > System). Consider a second cheap GPU (GT 1030) as the display adapter so your AI card runs headless with full VRAM available.

#4

VRAM fragmentation from consecutive loads/mismatched allocations

Diagnose

Model loads fine on fresh boot but slows down after the 3rd+ load/unload cycle. PyTorch's CUDA allocator can fragment VRAM over repeated allocations.

Fix

Restart the runtime between model switches. In PyTorch: `torch.cuda.empty_cache()` and `torch.cuda.reset_peak_memory_stats()`. For llama.cpp/Ollama: stop and restart the server.

#5

Wrong model size for your card — you're outside the efficient-fit envelope

Diagnose

A 70B Q4 on a 12 GB card will always page. The quant-chooser math: model params × quant bytes × 1.25 (KV buffer) = min VRAM. For 70B Q4: 70 × 4 ÷ 8 × 1.25 = 43.75 GB. You have 12 GB. Offloading 30+ GB to RAM is why tok/s is single-digit.

Fix

Match the model to your VRAM. 7-13B models for 6-8 GB cards. 13-34B for 12-16 GB. 70B+ needs 24 GB minimum at Q4. Or run on a cloud GPU for the large models.

Frequently asked questions

How slow is 'too slow' for token generation?

Reading speed (~10-15 tok/s) is acceptable for chat. Below 5 tok/s, the experience degrades — you're waiting for the model. Below 2 tok/s, the model is critically paging and will lose coherence on long responses. A well-fit model on a modern GPU should hit 20-60+ tok/s for 7-13B, 10-30 tok/s for 70B at Q4.

Does dual-GPU help with token speed?

Yes, if your runtime supports tensor parallelism (vLLM, llama.cpp). Splitting a 70B model across two 24 GB cards avoids paging entirely and gets you 20-30 tok/s instead of 3-5 tok/s. Dual mismatched cards (24 GB + 12 GB) work but the smaller card caps the per-layer VRAM, leaving some VRAM on the larger card unused.

Can I speed up token generation without buying a new GPU?

Drop the quant (Q4_K_M → IQ3_XXS for 70B, nearly halves VRAM with minimal quality loss). Shorten context to 4096. Enable flash-attention. Use speculative decoding (llama.cpp supports it via a draft model). If all else fails, rent a cloud GPU for large-model tasks and run small models locally.

What's the minimum VRAM for a usable experience with popular models in 2026?

Practical tiers: 6-8 GB — 7B Q4_K_M at short context, fine for summarization + drafting. 12 GB — 13B Q4_K_M at 4K-8K context, comfortable for coding assistants + chat. 16 GB — 32B Q4 at 4K context, or 13B at 32K+. 24 GB — 70B Q4_K_M at 4K-8K, the current sweet spot for agent workflows. Below 6 GB, stick to 3B models or quantized 7B at Q2 (noticeable quality loss). Above 24 GB — 70B at Q5+ or multi-user serving.

How do I calculate if a model will fit in VRAM before downloading it?

Rough formula: `model_params_billions × quant_bytes × 1.25 (KV cache buffer) = VRAM_GB`. Q4 = 4 bits ÷ 8 = 0.5 bytes per param. Q8 = 1.0 bytes. FP16 = 2.0 bytes. Example: 13B Q4 → 13 × 0.5 × 1.25 = 8.1 GB. 70B Q4 → 70 × 0.5 × 1.25 = 43.8 GB. Add 1-2 GB for runtime overhead. This is approximate but gets you within 10% of actual.

Should I buy a 16 GB card now or save for 24 GB?

If your use case is 7B-13B coding assistants, short-context chat, or image generation (SDXL/Flux), 16 GB is completely adequate. If you want to run 70B models at comfortable context lengths or plan to do agent-style work with long prompts, save for 24 GB — the 16 GB card will frustrate you within months. The RTX 4060 Ti 16 GB ($450-550) is the current budget sweet spot; the used RTX 3090 ($700-1,000) is the best $/GB ratio at 24 GB.

Related troubleshooting

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: