degradesEditorialReviewed May 2026

llama.cpp slow — when GPU isn't actually doing the work

If llama.cpp tok/s is 5-10x lower than expected on your GPU, the build probably defaulted to CPU, the model is partially CPU-offloaded, or flash-attention isn't enabled. Diagnose in 60 seconds with --verbose.

llama.cppNVIDIA CUDAAMD ROCmApple MetalVulkan backend
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

Build defaulted to CPU (GPU flag missing or build failed silently)

Diagnose

Run `./llama-cli --help` and check the backend list. If you don't see `cuda` / `metal` / `hip` / `vulkan` listed, the build is CPU-only.

Fix

Rebuild with the right flag: `cmake -B build -DGGML_CUDA=ON` (or GGML_METAL=ON / GGML_HIP=ON / GGML_VULKAN=ON). Wipe the build dir first to avoid stale CMakeCache: `rm -rf build`.

#2

Layers not offloaded to GPU (--n-gpu-layers / -ngl too low)

Diagnose

llama.cpp doesn't auto-offload all layers. Without `-ngl 999` (or model-specific count), layers stay on CPU. `nvidia-smi` shows VRAM usage low; CPU usage high during generation.

Fix

Pass `-ngl 999` to push all layers to GPU. For models that don't fit, pass a number that fits VRAM and accept partial offload. Watch VRAM during load to verify.

#3

Flash-attention not enabled

Diagnose

Long-context generation is slower than expected. `--verbose` doesn't mention flash-attention being active.

Fix

Add `-fa` flag (flash-attention). Cuts KV cache memory + speeds decode 20-40% on supported hardware (RTX 30/40/50-series, RDNA 3+, M-series Apple).

#4

Model file is too large for VRAM (paging from disk)

Diagnose

Model loads but generation is brutally slow (1-3 tok/s). `nvidia-smi` shows VRAM at 100%; disk activity high during inference.

Fix

Smaller quant (Q4_K_M instead of Q5_K_M halves VRAM). Smaller model. Or add VRAM by upgrading GPU.

#5

Number of threads misconfigured for prefill

Diagnose

Prefill (processing the prompt) is slow even though decode is fast. Default thread count may not match your CPU.

Fix

Set `-t <physical-cores>` (not logical/SMT cores). For Ryzen 7700X: `-t 8`. For Apple M-series, default usually optimal. Avoid setting threads higher than physical cores — hurts more than it helps.

#6

Running quantized model with FP16 KV cache

Diagnose

Long-context inference saturates VRAM faster than expected. KV cache at FP16 uses 2x the memory of Q8_0.

Fix

Use `--cache-type-k q8_0 --cache-type-v q8_0` to quantize KV cache. Saves 50% of context-related VRAM with minimal quality impact.

Frequently asked questions

What's a normal llama.cpp tok/s on my hardware?

Rough ranges (Q4_K_M with -ngl 999 + -fa): RTX 4090 — 7B ~120 t/s, 13B ~70, 70B ~12-15. RTX 3090 — 7B ~95, 13B ~55, 70B ~10-12. M4 Max — 7B ~85, 13B ~45, 70B ~7-9. If you're 5-10x lower, GPU isn't doing the work.

Should I use llama.cpp or vLLM for serving?

llama.cpp for solo / dev workflows + cross-platform compatibility. vLLM for production multi-user serving (paged KV cache + continuous batching). At 10+ concurrent users, vLLM's throughput is 3-5x llama.cpp.

Does llama.cpp support tensor-parallel multi-GPU?

Yes via `--split-mode row` (or `layer` for layer-split). Performance scales 1.5-1.8x on dual-GPU. ExLlamaV2 / vLLM scale better (1.8-1.9x) but llama.cpp is more portable.

Related troubleshooting

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: