Ollama is slow — diagnose CPU fallback in 3 minutes
Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.
Diagnostic order — most likely first
Ollama loaded the model on CPU because it didn't fit in VRAM
Run `ollama ps` while a model is loaded. If `PROCESSOR` shows '100% CPU' or '50%/50% CPU/GPU,' the model is partially or fully on CPU. tok/s will be 1-5 instead of 30-80.
Use a smaller quant. `ollama pull llama3.1:70b-q4_K_M` instead of `llama3.1:70b-q5_K_M`. Or switch to a smaller model. The tier-by-tier picks live in our buyer guide.
GPU drivers / CUDA version mismatch
`ollama serve` logs show 'no compatible GPUs found' or 'CUDA library not found.' `nvidia-smi` works in your shell but Ollama can't see the GPU.
Reinstall Ollama after updating NVIDIA drivers. On Linux: `sudo apt install nvidia-cuda-toolkit`. On Windows: ensure Ollama version matches CUDA version (12.x or 13.x branch). Restart the Ollama service after.
Model file is too large to mmap fully into VRAM
70B Q4 model (~40 GB on disk) on a 24 GB card. Ollama loads what fits, paginates the rest from disk. tok/s is brutal.
Match model size to VRAM. For 24 GB: 70B Q4 at short context, or 32B Q5 at long context. For 16 GB: 13B-32B Q4. For 12 GB: 7B-13B Q4.
Context window too large for VRAM headroom
Model loaded fast but generation is slow. KV cache for 32K context on a 70B is ~6-8 GB. If you didn't account for it, the model gets evicted to CPU.
Lower `OLLAMA_NUM_CTX` env var or set `num_ctx` in your Modelfile. 4K-8K is fine for most chat. 32K+ needs a 32 GB card.
Other GPU consumers holding VRAM
`nvidia-smi` shows Chrome / OBS / a game eating 4 GB.
Kill the other consumers. Or set `CUDA_VISIBLE_DEVICES` to a dedicated GPU.
If Ollama tok/s is consistently below your expectation, the issue is usually memory bandwidth, not driver tuning. The guide below frames the hardware decision for Ollama specifically.
Frequently asked questions
How do I tell if Ollama is using my GPU?
Run `ollama ps` while a model is loaded. The `PROCESSOR` column tells you the split. 100% GPU = all good. Anything with CPU in it = partial or full CPU fallback. You can also run `nvidia-smi` and look for the `ollama` process in the list.
What's a 'good' tok/s for Ollama on a 24 GB GPU?
Typical numbers on an RTX 3090 / 4090: 7B Q4 ~80-120 tok/s, 13B Q4 ~50-70, 32B Q4 ~25-35, 70B Q4 ~12-18. If you're 5-10x below these, you're CPU-fallback'd.
Can I make Ollama run faster without buying new hardware?
Sometimes. Smaller quant (Q4 → Q3 K), shorter context window, flash-attention (set `OLLAMA_FLASH_ATTENTION=1`), and closing other GPU consumers can reclaim 20-50% throughput. But if you're trying to run a 70B on 12 GB VRAM, no software fix helps — the model doesn't fit.
Ollama ps shows 100% GPU but tok/s is still terrible — why?
100% GPU in ollama ps means all layers were handed to the GPU at load time. It does NOT mean those layers fit in VRAM. If the model file is larger than VRAM, Ollama tells the GPU to load the layers but the OS pages excess memory through system RAM via the GPU driver's unified memory (UVM) path. This is the worst-case scenario: tok/s drops to 1-3 while GPU utilization stays high. The fix: smaller quant or smaller model — no software setting overrides physics.
Will Ollama ever support multi-GPU out of the box?
Not natively as of 2026. Ollama targets the single-GPU consumer workflow. For multi-GPU inference, switch to llama.cpp directly (pass `--split-mode row` and `--tensor-split`), or use vLLM/ExLlamaV2 for production multi-GPU serving. Ollama's opinion is 'one model, one GPU, one process.'
How do I verify the model file is actually the size I think it is?
Check the Ollama blobs directory: `ls -lh ~/.ollama/models/blobs/` for Linux/Mac, or `dir "%USERPROFILE%\.ollama\models\blobs\"` on Windows. Each blob filename is a SHA256 digest. Cross-reference the size with the model's Ollama library page — a 70B Q4_K_M should be roughly 40 GB. If the blob is substantially larger (e.g., a Q8 download when you expected Q4), you pulled the wrong tag.
Related troubleshooting
Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).
Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: