runner
Open source
free
4.4/5

ExLlamaV2

GPU-only inference library optimized for consumer NVIDIA cards. Fastest tokens-per-second on a single 24GB card for 30B models in EXL2 quant.

By Fredoline Eruo·Last verified May 6, 2026·4,500 GitHub stars

Overview

GPU-only inference library optimized for consumer NVIDIA cards. Fastest tokens-per-second on a single 24GB card for 30B models in EXL2 quant.

Pros

  • Top single-card NVIDIA speed
  • Custom EXL2 quant format
  • Tight memory usage

Cons

  • NVIDIA only
  • EXL2 ecosystem narrower than GGUF

Compatibility

Operating systems
Linux
Windows
GPU backends
NVIDIA CUDA
LicenseOpen source · free

Get ExLlamaV2

Frequently asked

Is ExLlamaV2 free?

Yes — ExLlamaV2 is free to download and use and open-source under a permissive license.

What operating systems does ExLlamaV2 support?

ExLlamaV2 supports Linux, Windows.

Which GPUs work with ExLlamaV2?

ExLlamaV2 supports NVIDIA CUDA. CPU-only inference is also possible but slow.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.