ExLlamaV2 vs vLLM — single-stream specialist vs production server
ExLlamaV2 and vLLM are both NVIDIA-first inference engines but solve very different problems. ExLlamaV2 is a single-stream specialist — its EXL2 4-bit quants and tuned kernels often produce the highest single-user tok/s on consumer NVIDIA cards. vLLM is a production-tier serving runtime — its strength is concurrent throughput, not single-stream speed.
If you have one user and one card and you want every token, ExLlamaV2 frequently wins. If you have multiple users (or a single agent loop spawning many parallel completions), vLLM wins on aggregate throughput by an order of magnitude.
Both run on Linux + NVIDIA. Neither is a good fit for AMD, Apple Silicon, or Windows native.
Quick decision rules
Operational matrix
| Dimension | ExLlamaV2 Fast 4-bit/EXL2 inference engine for NVIDIA GPUs. | vLLM Production serving runtime — continuous batching + paged attention. |
|---|---|---|
Single-stream tok/s One user at a time, one GPU. | Excellent Often fastest on consumer NVIDIA at 4-bit. | Strong Within 10-20%; not the design point. |
Concurrent serving Multiple users on one rig. | Limited Sequential by design; not a serving runtime. | Excellent Continuous batching; the reason most pick vLLM. |
Quant quality at 4-bit Output quality at small quants. | Excellent EXL2 quants at 4-4.5 bpw widely perceived top-tier. | Strong AWQ-INT4 / GPTQ; competitive but EXL2 often wins. |
Hardware support GPU types. | Limited NVIDIA only; Linux + WSL. | Strong NVIDIA + AMD ROCm. |
Multi-GPU Splitting models. | Acceptable Layer split; less polished than vLLM TP. | Excellent Tensor + pipeline parallel mature. |
OpenAI-compatible API Drop-in for existing tools. | Acceptable Via ExUI or third-party wrappers. | Excellent Native; the standard. |
Maintenance burden Operator hours. | Strong Few moving parts on a single GPU. | Limited More config knobs; CUDA + Python pinning. |
Community + docs Ecosystem maturity. | Acceptable Smaller; turboderp-led. | Excellent Largest LLM serving community. |
Failure modes — what breaks first
ExLlamaV2
- Sequential design — concurrency tanks throughput
- Smaller community — Stack Overflow hits sparse
- Linux+NVIDIA-only; no AMD/macOS
- EXL2 quants don't port to other engines
vLLM
- Flash-attention pinning incompatibilities
- Pip dependency conflicts on major releases
- OOM on long contexts when KV cache isn't pre-sized
- WSL2 GPU passthrough breakage
Editorial verdict
If you're a single user on a single NVIDIA card, ExLlamaV2 is often the fastest path to the most tok/s. The EXL2 quant quality at 4-bit is also widely respected.
If you're serving anyone other than yourself, switch to vLLM. ExLlamaV2 isn't a serving runtime — its sequential design means even two concurrent users tank throughput.
Don't pick ExLlamaV2 for an agent that spawns parallel tool calls — the parallelism doesn't help. Don't pick vLLM if you only need single-stream speed and the multi-knob config tax isn't worth it.