degradesEditorialReviewed May 2026

Ollama is slow — diagnose CPU fallback in 3 minutes

Ollama silently falls back to CPU when it can't load a model into VRAM. Here's how to confirm the fallback, force GPU usage, and pick a model that actually fits.

OllamaNVIDIA CUDAAMD ROCmApple Silicon Metal

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

Ollama loaded the model on CPU because it didn't fit in VRAM

Diagnose

Run `ollama ps` while a model is loaded. If `PROCESSOR` shows '100% CPU' or '50%/50% CPU/GPU,' the model is partially or fully on CPU. tok/s will be 1-5 instead of 30-80.

Fix

Use a smaller quant. `ollama pull llama3.1:70b-q4_K_M` instead of `llama3.1:70b-q5_K_M`. Or switch to a smaller model. The tier-by-tier picks live in our buyer guide.

Best GPU for local AI →

GPU drivers / CUDA version mismatch

Diagnose

`ollama serve` logs show 'no compatible GPUs found' or 'CUDA library not found.' `nvidia-smi` works in your shell but Ollama can't see the GPU.

Fix

Reinstall Ollama after updating NVIDIA drivers. On Linux: `sudo apt install nvidia-cuda-toolkit`. On Windows: ensure Ollama version matches CUDA version (12.x or 13.x branch). Restart the Ollama service after.

Model file is too large to mmap fully into VRAM

Diagnose

70B Q4 model (~40 GB on disk) on a 24 GB card. Ollama loads what fits, paginates the rest from disk. tok/s is brutal.

Fix

Match model size to VRAM. For 24 GB: 70B Q4 at short context, or 32B Q5 at long context. For 16 GB: 13B-32B Q4. For 12 GB: 7B-13B Q4.

Context window too large for VRAM headroom

Diagnose

Model loaded fast but generation is slow. KV cache for 32K context on a 70B is ~6-8 GB. If you didn't account for it, the model gets evicted to CPU.

Fix

Lower `OLLAMA_NUM_CTX` env var or set `num_ctx` in your Modelfile. 4K-8K is fine for most chat. 32K+ needs a 32 GB card.

Other GPU consumers holding VRAM

Diagnose

`nvidia-smi` shows Chrome / OBS / a game eating 4 GB.

Fix

Kill the other consumers. Or set `CUDA_VISIBLE_DEVICES` to a dedicated GPU.

If this keeps happening — the next decision is hardware

If Ollama tok/s is consistently below your expectation, the issue is usually memory bandwidth, not driver tuning. The guide below frames the hardware decision for Ollama specifically.

best GPU for Ollama

Frequently asked questions

How do I tell if Ollama is using my GPU?

Run `ollama ps` while a model is loaded. The `PROCESSOR` column tells you the split. 100% GPU = all good. Anything with CPU in it = partial or full CPU fallback. You can also run `nvidia-smi` and look for the `ollama` process in the list.

What's a 'good' tok/s for Ollama on a 24 GB GPU?

Typical numbers on an RTX 3090 / 4090: 7B Q4 ~80-120 tok/s, 13B Q4 ~50-70, 32B Q4 ~25-35, 70B Q4 ~12-18. If you're 5-10x below these, you're CPU-fallback'd.

Can I make Ollama run faster without buying new hardware?

Sometimes. Smaller quant (Q4 → Q3 K), shorter context window, flash-attention (set `OLLAMA_FLASH_ATTENTION=1`), and closing other GPU consumers can reclaim 20-50% throughput. But if you're trying to run a 70B on 12 GB VRAM, no software fix helps — the model doesn't fit.

Ollama ps shows 100% GPU but tok/s is still terrible — why?

100% GPU in ollama ps means all layers were handed to the GPU at load time. It does NOT mean those layers fit in VRAM. If the model file is larger than VRAM, Ollama tells the GPU to load the layers but the OS pages excess memory through system RAM via the GPU driver's unified memory (UVM) path. This is the worst-case scenario: tok/s drops to 1-3 while GPU utilization stays high. The fix: smaller quant or smaller model — no software setting overrides physics.

Will Ollama ever support multi-GPU out of the box?

Not natively as of 2026. Ollama targets the single-GPU consumer workflow. For multi-GPU inference, switch to llama.cpp directly (pass `--split-mode row` and `--tensor-split`), or use vLLM/ExLlamaV2 for production multi-GPU serving. Ollama's opinion is 'one model, one GPU, one process.'

How do I verify the model file is actually the size I think it is?

Check the Ollama blobs directory: `ls -lh ~/.ollama/models/blobs/` for Linux/Mac, or `dir "%USERPROFILE%\.ollama\models\blobs\"` on Windows. Each blob filename is a SHA256 digest. Cross-reference the size with the model's Ollama library page — a 70B Q4_K_M should be roughly 40 GB. If the blob is substantially larger (e.g., a Q8 download when you expected Q4), you pulled the wrong tag.

Related troubleshooting

CUDA out of memory

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?