Ollama vs llama.cpp vs vLLM — which runtime should I use?

The answer

One paragraph. No hedging beyond what the data actually warrants.

The 30-second decision rule:

Solo / small team → Ollama. The "just works" path.
Production multi-user serving → vLLM. Continuous batching is non-optional.
Privacy-strict + auditable → llama.cpp. Source-buildable, no telemetry.

The honest tradeoffs:

Ollama wraps llama.cpp underneath but ships as a single binary with one-command install, model catalog, automatic GPU detection, and an OpenAI-compatible HTTP API at :11434. Cost: ~30-40% lower throughput than vLLM under concurrent load, limited multi-user support. Pick this first. Switch only if you hit its limits.

llama.cpp is the underlying C++ inference engine. Builds from source in ~2 minutes, no telemetry, no network calls, runs on every architecture under the sun. Cost: more flag-juggling than Ollama. Worth the trouble when (a) you're auditing the whole stack for compliance, (b) you're on an architecture Ollama doesn't ship binaries for, or (c) you need to customize the inference path.

vLLM is the production-grade serving runtime: continuous batching + paged attention + prefix caching. NVIDIA only (no Apple Silicon, no AMD GPU — AMD support via ROCm 6.4+ landed late 2025). 3-5× higher throughput than Ollama under multi-user load. Cost: Python + CUDA setup, more moving parts, expects to run as a serving process not a CLI.

The misconception to avoid: "vLLM is faster than Ollama, so I should always use vLLM." False. On a single-user workload (1 request at a time), Ollama and vLLM are within 5-10% of each other. vLLM's win is concurrency. If you're not serving multiple users, vLLM's complexity is overhead with no upside.

The fourth option you might want: MLX-LM on Apple Silicon. None of the above are Apple-optimized; MLX is. Use Ollama if you want the easy path (it wraps MLX on macOS) or mlx_lm directly if you want to tune.