Cross-engine comparison
Editorial

vLLM vs llama.cpp vs Ollama vs MLX vs LM Studio

Five runtimes that matter for local AI in 2026, scored along eleven operational dimensions. Each tier is editorial, grounded in the public benchmark corpus, and carries a one-line caveat naming the assumption. No runtime wins on every dimension; this matrix surfaces what each one trades.

Why we publish this and inference engine vendors don't: we have no engine to sell. The analytical layer above any single runtime is the only place a neutral comparison can live.

Dimension
vLLM
Production serving
llama.cpp
Cross-platform CPU+GPU
Ollama
Local-first wrapper
MLX
Apple Silicon native
LM Studio
Desktop app
Raw throughput (decode tok/s)
Single-stream tok/s on equivalent hardware + model.
Excellent
Continuous batching + paged attention; consistently fastest at concurrent load.
Strong
GGUF + CUDA/Metal/Vulkan kernels; gap to vLLM closes on single-stream.
Strong
Wraps llama.cpp; throughput within a few % when configured well.
Strong
Optimized for Apple Silicon unified memory; matches llama.cpp on M-series.
Strong
Wraps llama.cpp; UI overhead is the only cost vs raw llama.cpp.
Concurrent users / multi-tenant
How throughput holds up under N concurrent requests.
Excellent
Built for it. Linear scaling to dozens of users on a single GPU.
Limited
Sequential by default; add a frontend (LocalAI, llama-swap) to multiplex.
Limited
OLLAMA_NUM_PARALLEL helps; far from vLLM's continuous-batching tier.
Acceptable
Single-user is the typical Apple Silicon workload; multi-user is a research corner.
Limited
Single-user desktop app; not a concurrent-serving target.
Maintenance burden
How much operator time the runtime costs in a year of operation.
Limited
CUDA + driver + Python + flash-attention version pinning. Production-grade ops or it breaks.
Strong
Self-contained binary or build; very few moving parts.
Excellent
Single binary, auto-update path, lowest operator overhead in the ecosystem.
Strong
Apple-managed framework; Apple Silicon driver = macOS update.
Excellent
GUI app; updates via the app itself.
Reproducibility
Can you stand up the same setup six months later?
Acceptable
Pin Python + CUDA + flash-attention + vLLM version. Multi-knob; reproducible if you write it down.
Strong
Pin commit hash + GGUF; that's it. Most reproducible runtime in the ecosystem.
Strong
Manifest + model digest pin. Auto-update can drift if you don't pin.
Acceptable
Reproducible within macOS major version; Sonoma → Sequoia is a known break point.
Limited
App version + model file; export config and pin yourself or accept drift.
OS support
Real-world stable on which platforms.
Limited
Linux first-class; Windows via WSL2; macOS not supported.
Excellent
Linux + macOS + Windows + iOS + Android. Most portable.
Strong
Linux + macOS + Windows. WSL backend on Windows GPU.
Limited
macOS only.
Strong
Linux + macOS + Windows; not headless.
Lock-in risk
If the project went unmaintained tomorrow, what would you lose?
Acceptable
Open-source, but the ecosystem of optimizations is hard to replicate elsewhere.
Strong
GGUF format is portable; switching to KoboldCpp / LocalAI / Ollama keeps your weights working.
Strong
Wraps llama.cpp; underlying weights and digests stay portable.
Limited
MLX-quantized weights are MLX-specific; converting back to GGUF/safetensors loses the quant.
Acceptable
Uses GGUF; weights are portable, GUI is replaceable.
Observability
Logs, metrics, traces — what you can see when something is wrong.
Strong
Prometheus metrics endpoint; structured logs; easy to integrate with Grafana.
Acceptable
Verbose-flag stderr; you write your own metrics scrape.
Acceptable
Server logs; OLLAMA_DEBUG; no native metrics endpoint.
Limited
Library-level; you instrument your own wrapper.
Limited
GUI logs window; not designed for ops.
Multi-GPU support
Tensor / pipeline parallel across multiple cards.
Excellent
Tensor + pipeline parallel; first-class. The reason most multi-GPU rigs run vLLM.
Strong
Layer-split across GPUs; functional but not as fast as vLLM tensor parallel.
Acceptable
Inherits llama.cpp split; fine for inference, not for serving.
Apple Silicon = one chip; multi-machine is a separate research path.
Limited
Single-device app; multi-GPU is an edge case.
Speculative decoding
Draft model + verifier acceleration.
Strong
Production-grade EAGLE + Medusa support.
Strong
Built-in `--draft-model` flag; works on consumer hardware.
Limited
Inherits when llama.cpp version supports it; not surfaced in the API.
Acceptable
Available via mlx-lm; not as polished as llama.cpp.
Limited
Not surfaced in the GUI as of inspection.
Mobile / edge
Phones, NPUs, embedded.
Server runtime; mobile is out of scope.
Excellent
Builds on iOS, Android, RPi. The reference mobile inference runtime.
Limited
Server-class; mobile is via llama.cpp directly, not Ollama.
Strong
iPad + iPhone via mlx-swift.
Desktop only.
Update cadence
How often the project ships releases.
Excellent
Active weekly cadence; large research org behind it.
Excellent
Multiple commits per day; the most active runtime project.
Strong
Frequent point releases; broadly stable.
Strong
Active Apple-affiliated development; quarterly major + monthly point.
Strong
Regular app releases; vendor-driven cadence.

How to read this matrix

Pick the dimensions that matter for YOUR deployment. A homelab operator running a single model on one GPU can ignore the multi-tenant column entirely; a small SaaS team running shared inference cannot.

When tiers are close, the deciding factor is usually maintenance burden + reproducibility. The fastest runtime on paper is not the fastest one to operate.

When tiers are far apart, the dimension is doing real work. vLLM at “limited” on maintenance burden is a real cost; Ollama at “limited” on multi-tenant is a real ceiling.

Next steps

FP16 vs Q8 vs Q5 vs Q4 — what each step down costs you in quality.