degradesEditorialReviewed May 2026

Token generation is slow everywhere — debug the system bottleneck

Slow token generation across multiple runtimes (not specific to Ollama or vLLM) means a system-level bottleneck: GPU underutilization, missing flash-attention, wrong thread count, thermal throttle, or VRAM paging.

llama.cppOllamavLLMTransformersExLlamaV2LM Studio

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

GPU not reaching full utilization

Diagnose

`nvidia-smi -l 1` during generation shows GPU utilization below 80%. If utilization bounces (spikes to 95% then drops to 20%), the GPU is starved by the CPU — model layers aren't getting to the GPU fast enough.

Fix

Increase batch size. In vLLM: `--max-num-seqs` higher. In llama.cpp: increase `-n` (batch size) and `-t` (threads). If loading from spinning disk, move the model to an NVMe SSD — HDD model-load latency creates gaps in the GPU pipeline.

Flash-attention not installed or not enabled

Diagnose

Monitor VRAM usage during long-context runs. If the KV cache grows linearly and VRAM fills fast, flash-attention is not active. Most runtimes log a warning like 'flash-attention not found, falling back to SDPA.'

Fix

Install flash-attention: `pip install flash-attn --no-build-isolation` (needs CUDA Toolkit and MSVC on Windows). In vLLM, set `VLLM_ATTENTION_BACKEND=FLASH_ATTN`. In Transformers: `model = AutoModelForCausalLM.from_pretrained(..., attn_implementation='flash_attention_2')`. Flash-attention cuts KV cache VRAM by 30-50% and speeds attention by 2-4x on long context.

CPU thread count misconfigured

Diagnose

CPU utilization during generation is either pegged at 100% on all cores (oversubscribed — threads too high) or stuck at 25% (undersubscribed — threads too low). Wrong thread count creates a CPU bottleneck that starves the GPU.

Fix

Set threads to physical core count (not hyperthreaded). For a 16-core (8P + 8E) Intel: use `-t 8` in llama.cpp. For a 12-core AMD: use `-t 12`. Hyperthreading (SMT) helps tokenize/decode but hurts prompt processing. Test both and benchmark.

GPU thermal throttling under sustained inference

Diagnose

Tok/s starts at expected speed (e.g., 40 tok/s) then drops steadily over 2-5 minutes to half speed. `nvidia-smi -q -d TEMPERATURE` shows GPU temp hitting 83-87°C, with clocks dropping from ~2.5 GHz to ~1.2 GHz.

Fix

Improve case airflow. Set a custom fan curve in MSI Afterburner (more aggressive). Undervolt the GPU (lower power limit to 80% in Afterburner — you lose ~5% peak speed but eliminate throttling, netting higher sustained throughput). Clean dust from heatsink.

VRAM paging (model + KV cache exceed card memory)

Diagnose

`nvidia-smi` shows VRAM at 100%. GPU utilization is low (50-70%) by GPU engine but high on the memory controller. System RAM usage is elevated. The runtime is offloading layers and KV cache to RAM — paging kills throughput.

Fix

Reduce context length. Drop quant. Or switch to a smaller model that fits entirely in VRAM. For multi-GPU systems, enable tensor parallelism to split the model.

Frequently asked questions

What tok/s should I expect from my GPU?

Rough baseline at Q4_K_M, 4K context: RTX 3060 12 GB on 7B = 40-60 tok/s, 13B = 25-35 tok/s. RTX 3090/4090 on 7B = 80-120 tok/s, 13B = 50-70 tok/s, 70B = 15-25 tok/s. RTX 5090 on 70B = 25-40 tok/s. If you're at 50% or less of these numbers, something is bottlenecking.

Why is prompt processing slow but token generation is fast?

Prompt processing (prefill) is compute-bound — the model processes the entire prompt in parallel. If it's slow, check: (1) flash-attention not enabled, (2) CPU thread count too low (prompt processing uses more CPU in llama.cpp), (3) batch size capped. Token generation (decode) is memory-bandwidth-bound — tok/s is fundamentally limited by VRAM bandwidth.

Should I upgrade my CPU or GPU to improve token generation speed?

GPU first, always. Token generation is memory-bandwidth-bound — a card with more VRAM bandwidth (RTX 4090 at 1,008 GB/s vs 3060 at 360 GB/s) gives proportionally faster tok/s for same-model generations. CPU matters for prompt processing and batch decode, but the GPU is the primary limiter.

How do I benchmark tok/s properly across different runtimes?

Use a standardized prompt (e.g., 'Explain the transformer architecture in detail') at 4K context, generate 512 tokens, measure wall-clock time. Repeat 3 times and take the median to smooth cold-start variance. For llama.cpp: `./llama-cli -m model.gguf -p '...' -n 512 --verbose-prompt` reports tok/s in the output. For vLLM: use the `/v1/completions` endpoint and compute from `usage.completion_tokens / elapsed_time`. For Ollama: `ollama run <model> --verbose` prints eval rate in tok/s.

Why does my token generation speed vary by 2-3x on the same prompt?

Token generation is memory-bandwidth-bound — consistent speed is the normal expectation. Variation usually means: (1) background processes competing for VRAM (Chrome, Discord, Windows compositor), (2) thermal throttling (check `nvidia-smi -q -d TEMPERATURE` — if GPU hits 85°C+ clocks drop), (3) power limit throttling (the card's power limit is lower than sustained inference draws — common on OEM cards or laptops), (4) CPU batch-size variance. Run `nvidia-smi -l 1` during the variable runs and watch for clock speed drops.

Does Mac or PC have better tok/s-per-dollar in 2026?

PC wins on $/tok for inference. RTX 3090 ($700 used) delivers ~12-15 tok/s on 70B Q4 vs M4 Max (~$2,000 Mac) at ~7-9 tok/s. But Mac wins on the integrated experience: unified memory means no VRAM ceiling anxiety, quieter operation, lower power draw, and a machine that's also a great laptop. For budget-maximizing inference, used 3090 + Linux. For the holistic experience, Apple Silicon. The gap narrows every generation.

Related troubleshooting

Model loads but generation is slow / tok/s far below expectation

When the model loads (no OOM) but token generation is far below expected speeds, the bottleneck is usually VRAM paging, KV cache overcommit, or GPU contention. Here's how to diagnose and fix each.

Ollama is slow / running on CPU instead of GPU

Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?