vLLM vs llama.cpp — server vs portable inference
vLLM and llama.cpp solve different problems. vLLM is a production-grade LLM serving runtime; llama.cpp is a portable inference engine that runs anywhere. They overlap in single-stream tok/s on a single GPU but diverge on every other axis.
If you're serving multiple concurrent users, vLLM's continuous batching + paged attention will outperform llama.cpp by orders of magnitude. If you're running one model on one machine for one user — or if your machine isn't an NVIDIA GPU on Linux — llama.cpp wins on portability + simplicity.
Most operators end up using both: llama.cpp on the laptop / Mac / homelab, vLLM on the production rack.
Quick decision rules
Operational matrix
| Dimension | vLLM Production serving runtime — continuous batching + paged attention. | llama.cpp Cross-platform CPU+GPU inference; the reference portable runtime. |
|---|---|---|
Concurrent serving 10+ users on one rig. | Excellent Built for it. | Limited Sequential by default; LocalAI/llama-swap to multiplex. |
Single-stream tok/s One user at a time. | Excellent Fastest in the category. | Strong Within a few % on the same GPU. |
OS portability Realistic stable platforms. | Limited Linux first-class; Windows WSL2; no macOS. | Excellent Linux + macOS + Windows + iOS + Android. |
Hardware portability Card types supported. | Strong NVIDIA + AMD ROCm; CUDA-first. | Excellent CUDA + Metal + Vulkan + OpenCL + CPU-only. |
Reproducibility Stand the same setup six months later. | Acceptable Multi-knob; pin Python + CUDA + flash-attention + vLLM. | Strong Pin commit + GGUF; that's it. |
Multi-GPU Tensor-parallel across cards. | Excellent Tensor + pipeline parallel; first-class. | Strong Layer-split; functional but slower than vLLM TP. |
Mobile / embedded Phones, RPi, Jetson. | — Server runtime; out of scope. | Excellent Reference mobile inference runtime. |
Maintenance burden Operator hours per month. | Limited 5-10 h/mo on driver / runtime / pin updates. | Strong <1 h/mo. Self-contained binary. |
Observability Logs + metrics. | Strong Prometheus endpoint native. | Acceptable Verbose stderr; you write your own metrics. |
Failure modes — what breaks first
vLLM
- Flash-attention pinning incompatibilities after a CUDA upgrade
- Pip dependency conflicts when the runtime ships a major release
- OOM on long contexts when KV cache isn't pre-sized
- WSL2 GPU passthrough breaks on Windows kernel updates
llama.cpp
- Outdated GGUF format after a major schema change (rare but happens)
- Metal kernel issues on macOS major-version transitions
- Vulkan support varies by driver — Intel/AMD inconsistent
- Older quants (Q4_0 / Q5_0) deprecated in favor of K-quants
Editorial verdict
If your workload is single-user single-machine, llama.cpp is almost always the right answer. The maintenance burden is dramatically lower, the OS coverage is dramatically wider, and the throughput gap on single-stream is small enough not to matter day-to-day.
If you're serving anyone other than yourself — paying users, a small team, even a few colleagues — switch to vLLM the moment concurrent throughput matters. llama.cpp + a multiplexer (LocalAI, llama-swap) gets you 80% there but vLLM's continuous batching is the structural answer.
Use both. Operators we trust run llama.cpp on every laptop/desktop they touch and vLLM only on the production rack. They're different tools.