server
Open source
free
4.8/5

vLLM

High-throughput serving engine. PagedAttention, continuous batching, prefix caching. Production default for self-hosted LLM APIs at scale.

By Fredoline Eruo·Last verified May 6, 2026·50,000 GitHub stars

Overview

High-throughput serving engine. PagedAttention, continuous batching, prefix caching. Production default for self-hosted LLM APIs at scale.

Pros

  • Best throughput in class
  • OpenAI-compatible API
  • Tensor parallelism
  • Speculative decoding

Cons

  • Linux-only
  • GPU-only
  • Steeper learning curve than Ollama

Compatibility

Operating systems
Linux
GPU backends
NVIDIA CUDA
AMD ROCm
Intel Gaudi
TPU
LicenseOpen source · free

Get vLLM

Frequently asked

Is vLLM free?

Yes — vLLM is free to download and use and open-source under a permissive license.

What operating systems does vLLM support?

vLLM supports Linux.

Which GPUs work with vLLM?

vLLM supports NVIDIA CUDA, AMD ROCm, Intel Gaudi, TPU. CPU-only inference is also possible but slow.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.