Engine vs engine
Editorial

Ollama vs llama.cpp — wrapper vs raw runtime

OllamaEditorial

Local-first wrapper over llama.cpp with ergonomic model management.

Project page →
llama.cppEditorial

Cross-platform CPU+GPU inference; the reference portable runtime.

Project page →

Ollama wraps llama.cpp. Underneath, the inference engine is the same — the throughput gap is small. The decision is about the layer above: do you want Ollama's ergonomics, or do you want llama.cpp's control?

Ollama wins on developer experience by a mile. `ollama pull llama3` and you're running. Model management, OpenAI-compatible API, auto-update — all handled. llama.cpp gives you full control over build flags, kernel selection, server config — at the cost of writing more shell.

Most operators start with Ollama. Some grow out of it as their needs get specific (custom kernel flags, manual KV cache sizing, multi-GPU layer splits).

Quick decision rules

First-time local AI user, want one binary that works
→ Choose Ollama
Need custom build flags or experimental kernels
→ Choose llama.cpp
Ollama doesn't expose all llama.cpp config knobs.
Multi-machine deployment with reproducibility requirements
→ Choose llama.cpp
Pin a llama.cpp commit; Ollama auto-update can drift.
Need OpenAI-compatible API + simple model management
→ Choose Ollama

Operational matrix

Dimension
Ollama
Local-first wrapper over llama.cpp with ergonomic model management.
llama.cpp
Cross-platform CPU+GPU inference; the reference portable runtime.
Setup time
First-success latency for a new user.
Excellent
Single installer; first model running in under 5 min.
Acceptable
Compile + flag selection + GGUF download. ~30 min first time.
Model management
Pulling, caching, updating models.
Excellent
`ollama pull` + manifest is the design point.
Limited
Manual: download GGUF, organize files yourself.
OpenAI-compatible API
Drop-in for existing tools.
Excellent
Built-in `/v1/chat/completions`.
Strong
`llama-server` provides OpenAI-compatible mode.
Build / kernel flexibility
Custom compile flags, kernel selection.
Limited
Hidden behind environment variables; some flags missing.
Excellent
Full Make/CMake control; the design point.
Multi-GPU split
Layer split across cards.
Acceptable
OLLAMA_NUM_PARALLEL + auto-split; less precise control.
Strong
Manual `--n-gpu-layers` + `--tensor-split` for fine control.
Reproducibility
Same setup six months later.
Strong
Manifest + model digest pin; auto-update can drift if you don't pin.
Excellent
Pin commit hash + GGUF. Most reproducible runtime.
Maintenance burden
Operator hours per month.
Excellent
Effectively zero. Auto-update + restart on schedule.
Strong
<1 h/mo if you pin; you choose when to upgrade.
Concurrent users
How throughput holds up.
Limited
OLLAMA_NUM_PARALLEL helps; not a serving runtime.
Limited
Same ceiling; switch to vLLM for serving.

Failure modes — what breaks first

Ollama

  • Auto-update can ship a regression that breaks your model
  • Hidden config knobs — some llama.cpp flags aren't exposed
  • WSL backend flakiness on Windows GPU
  • Daemon restart loses concurrent state

llama.cpp

  • GGUF format drift after major version bumps
  • Build flag combinations that compile but produce wrong output
  • Manual model file management → broken symlinks
  • Vulkan support varies wildly by GPU + driver

Editorial verdict

Default to Ollama. The DX gap is enormous — model management, auto-update, OpenAI-compatible API, and a sane out-of-the-box config make first-success time five minutes instead of an hour.

Switch to llama.cpp when (a) you need custom build flags Ollama doesn't expose (rare for hobby users; common for advanced multi-GPU setups), (b) you need exact reproducibility across machines, or (c) you're shipping a product that embeds inference and you don't want a wrapper layer.

Don't switch to llama.cpp 'because it's faster' — they're the same engine. Performance differences are usually config differences.

Related operator surfaces