Ollama vs llama.cpp vs vLLM — which runtime should I use?
The answer
One paragraph. No hedging beyond what the data actually warrants.
The 30-second decision rule:
- Solo / small team → Ollama. The "just works" path.
- Production multi-user serving → vLLM. Continuous batching is non-optional.
- Privacy-strict + auditable → llama.cpp. Source-buildable, no telemetry.
The honest tradeoffs:
Ollama wraps llama.cpp underneath but ships as a single binary with one-command install, model catalog, automatic GPU detection, and an OpenAI-compatible HTTP API at :11434. Cost: ~30-40% lower throughput than vLLM under concurrent load, limited multi-user support. Pick this first. Switch only if you hit its limits.
llama.cpp is the underlying C++ inference engine. Builds from source in ~2 minutes, no telemetry, no network calls, runs on every architecture under the sun. Cost: more flag-juggling than Ollama. Worth the trouble when (a) you're auditing the whole stack for compliance, (b) you're on an architecture Ollama doesn't ship binaries for, or (c) you need to customize the inference path.
vLLM is the production-grade serving runtime: continuous batching + paged attention + prefix caching. NVIDIA only (no Apple Silicon, no AMD GPU — AMD support via ROCm 6.4+ landed late 2025). 3-5× higher throughput than Ollama under multi-user load. Cost: Python + CUDA setup, more moving parts, expects to run as a serving process not a CLI.
The misconception to avoid: "vLLM is faster than Ollama, so I should always use vLLM." False. On a single-user workload (1 request at a time), Ollama and vLLM are within 5-10% of each other. vLLM's win is concurrency. If you're not serving multiple users, vLLM's complexity is overhead with no upside.
The fourth option you might want: MLX-LM on Apple Silicon. None of the above are Apple-optimized; MLX is. Use Ollama if you want the easy path (it wraps MLX on macOS) or mlx_lm directly if you want to tune.
Explore the numbers for your specific stack
Where we got the numbers
Throughput claims: vLLM continuous batching paper + community benchmarks on r/LocalLLaMA. Ollama wrapping llama.cpp: Ollama docs + source code.
Also see
Setup, model catalog, common gotchas.
Production-serving configuration, tensor parallelism, prefix caching.
Building from source, runtime flags, the offline audit case.
Get a full rig recipe with the right runtime auto-selected for your scale.
Other questions in this thread
Other /q/ landings on the same topic — same editorial discipline.
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.