vLLM vs llama.cpp vs Ollama vs MLX vs LM Studio
Five runtimes that matter for local AI in 2026, scored along eleven operational dimensions. Each tier is editorial, grounded in the public benchmark corpus, and carries a one-line caveat naming the assumption. No runtime wins on every dimension; this matrix surfaces what each one trades.
Why we publish this and inference engine vendors don't: we have no engine to sell. The analytical layer above any single runtime is the only place a neutral comparison can live.
| Dimension | vLLM Production serving | llama.cpp Cross-platform CPU+GPU | Ollama Local-first wrapper | MLX Apple Silicon native | LM Studio Desktop app |
|---|---|---|---|---|---|
Raw throughput (decode tok/s) Single-stream tok/s on equivalent hardware + model. | Excellent Continuous batching + paged attention; consistently fastest at concurrent load. | Strong GGUF + CUDA/Metal/Vulkan kernels; gap to vLLM closes on single-stream. | Strong Wraps llama.cpp; throughput within a few % when configured well. | Strong Optimized for Apple Silicon unified memory; matches llama.cpp on M-series. | Strong Wraps llama.cpp; UI overhead is the only cost vs raw llama.cpp. |
Concurrent users / multi-tenant How throughput holds up under N concurrent requests. | Excellent Built for it. Linear scaling to dozens of users on a single GPU. | Limited Sequential by default; add a frontend (LocalAI, llama-swap) to multiplex. | Limited OLLAMA_NUM_PARALLEL helps; far from vLLM's continuous-batching tier. | Acceptable Single-user is the typical Apple Silicon workload; multi-user is a research corner. | Limited Single-user desktop app; not a concurrent-serving target. |
Maintenance burden How much operator time the runtime costs in a year of operation. | Limited CUDA + driver + Python + flash-attention version pinning. Production-grade ops or it breaks. | Strong Self-contained binary or build; very few moving parts. | Excellent Single binary, auto-update path, lowest operator overhead in the ecosystem. | Strong Apple-managed framework; Apple Silicon driver = macOS update. | Excellent GUI app; updates via the app itself. |
Reproducibility Can you stand up the same setup six months later? | Acceptable Pin Python + CUDA + flash-attention + vLLM version. Multi-knob; reproducible if you write it down. | Strong Pin commit hash + GGUF; that's it. Most reproducible runtime in the ecosystem. | Strong Manifest + model digest pin. Auto-update can drift if you don't pin. | Acceptable Reproducible within macOS major version; Sonoma → Sequoia is a known break point. | Limited App version + model file; export config and pin yourself or accept drift. |
OS support Real-world stable on which platforms. | Limited Linux first-class; Windows via WSL2; macOS not supported. | Excellent Linux + macOS + Windows + iOS + Android. Most portable. | Strong Linux + macOS + Windows. WSL backend on Windows GPU. | Limited macOS only. | Strong Linux + macOS + Windows; not headless. |
Lock-in risk If the project went unmaintained tomorrow, what would you lose? | Acceptable Open-source, but the ecosystem of optimizations is hard to replicate elsewhere. | Strong GGUF format is portable; switching to KoboldCpp / LocalAI / Ollama keeps your weights working. | Strong Wraps llama.cpp; underlying weights and digests stay portable. | Limited MLX-quantized weights are MLX-specific; converting back to GGUF/safetensors loses the quant. | Acceptable Uses GGUF; weights are portable, GUI is replaceable. |
Observability Logs, metrics, traces — what you can see when something is wrong. | Strong Prometheus metrics endpoint; structured logs; easy to integrate with Grafana. | Acceptable Verbose-flag stderr; you write your own metrics scrape. | Acceptable Server logs; OLLAMA_DEBUG; no native metrics endpoint. | Limited Library-level; you instrument your own wrapper. | Limited GUI logs window; not designed for ops. |
Multi-GPU support Tensor / pipeline parallel across multiple cards. | Excellent Tensor + pipeline parallel; first-class. The reason most multi-GPU rigs run vLLM. | Strong Layer-split across GPUs; functional but not as fast as vLLM tensor parallel. | Acceptable Inherits llama.cpp split; fine for inference, not for serving. | — Apple Silicon = one chip; multi-machine is a separate research path. | Limited Single-device app; multi-GPU is an edge case. |
Speculative decoding Draft model + verifier acceleration. | Strong Production-grade EAGLE + Medusa support. | Strong Built-in `--draft-model` flag; works on consumer hardware. | Limited Inherits when llama.cpp version supports it; not surfaced in the API. | Acceptable Available via mlx-lm; not as polished as llama.cpp. | Limited Not surfaced in the GUI as of inspection. |
Mobile / edge Phones, NPUs, embedded. | — Server runtime; mobile is out of scope. | Excellent Builds on iOS, Android, RPi. The reference mobile inference runtime. | Limited Server-class; mobile is via llama.cpp directly, not Ollama. | Strong iPad + iPhone via mlx-swift. | — Desktop only. |
Update cadence How often the project ships releases. | Excellent Active weekly cadence; large research org behind it. | Excellent Multiple commits per day; the most active runtime project. | Strong Frequent point releases; broadly stable. | Strong Active Apple-affiliated development; quarterly major + monthly point. | Strong Regular app releases; vendor-driven cadence. |
How to read this matrix
Pick the dimensions that matter for YOUR deployment. A homelab operator running a single model on one GPU can ignore the multi-tenant column entirely; a small SaaS team running shared inference cannot.
When tiers are close, the deciding factor is usually maintenance burden + reproducibility. The fastest runtime on paper is not the fastest one to operate.
When tiers are far apart, the dimension is doing real work. vLLM at “limited” on maintenance burden is a real cost; Ollama at “limited” on multi-tenant is a real ceiling.
Next steps
FP16 vs Q8 vs Q5 vs Q4 — what each step down costs you in quality.