Frameworks & tools

ExLlamaV2

Also known as: exllama-v2, exllamav2 runtime

ExLlamaV2 is a high-performance inference engine for Llama-family models, optimized for GPU execution. It achieves faster token generation by using a custom CUDA kernel for attention and by supporting 4-bit and 8-bit quantization via its own GPTQ implementation. Operators encounter ExLlamaV2 when they need maximum throughput on a single GPU, especially with quantized models, as it often outperforms llama.cpp and Hugging Face Transformers in tokens per second.

Deeper dive

ExLlamaV2 is a rewrite of the original ExLlama, focusing on efficiency for Llama-based architectures (including Llama 2, Llama 3, Mistral, and CodeLlama). Its key innovation is a fused attention kernel that reduces memory bandwidth overhead, and it uses a custom GPTQ quantization scheme that packs weights into 4-bit or 8-bit integers with minimal accuracy loss. The runtime supports dynamic batching, flash attention, and split-GPU inference (for multi-GPU setups). Unlike llama.cpp, which is CPU-first with GPU offload, ExLlamaV2 is GPU-first and requires all layers to fit in VRAM for full speed. It is commonly used with the text-generation-webui (Oobabooga) and can be integrated via the exllamav2 Python package. Operators choose ExLlamaV2 when they have a single high-VRAM GPU (e.g., RTX 3090 24GB) and want the fastest possible inference for 7B–30B parameter models at 4-bit quantization.

Practical example

On an RTX 3090 (24GB VRAM), running Llama 3 8B at 4-bit with ExLlamaV2 achieves 100-120 tokens/second, compared to ~60-80 tok/s with llama.cpp GPU offload. For a 30B model at 4-bit (16GB), ExLlamaV2 still fits in VRAM and runs at ~40-50 tok/s, while llama.cpp would need to offload layers to system RAM, dropping to ~10 tok/s.

Workflow example

In text-generation-webui, operators select the ExLlamaV2 loader from the Model tab, then choose a GPTQ quantized model (e.g., TheBloke/Llama-2-13B-GPTQ). The UI shows VRAM usage and tokens/second. Operators can adjust max_seq_len to fit context within VRAM. If VRAM is insufficient, ExLlamaV2 will error or fall back to CPU offload, which is much slower.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work

Deeper dive

Practical example

Workflow example

Related terms