RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Frameworks & tools / ExLlamaV2
Frameworks & tools

ExLlamaV2

Also known as: exllama-v2, exllamav2 runtime

ExLlamaV2 is a high-performance inference engine for Llama-family models, optimized for GPU execution. It achieves faster token generation by using a custom CUDA kernel for attention and by supporting 4-bit and 8-bit quantization via its own GPTQ implementation. Operators encounter ExLlamaV2 when they need maximum throughput on a single GPU, especially with quantized models, as it often outperforms llama.cpp and Hugging Face Transformers in tokens per second.

Deeper dive

ExLlamaV2 is a rewrite of the original ExLlama, focusing on efficiency for Llama-based architectures (including Llama 2, Llama 3, Mistral, and CodeLlama). Its key innovation is a fused attention kernel that reduces memory bandwidth overhead, and it uses a custom GPTQ quantization scheme that packs weights into 4-bit or 8-bit integers with minimal accuracy loss. The runtime supports dynamic batching, flash attention, and split-GPU inference (for multi-GPU setups). Unlike llama.cpp, which is CPU-first with GPU offload, ExLlamaV2 is GPU-first and requires all layers to fit in VRAM for full speed. It is commonly used with the text-generation-webui (Oobabooga) and can be integrated via the exllamav2 Python package. Operators choose ExLlamaV2 when they have a single high-VRAM GPU (e.g., RTX 3090 24GB) and want the fastest possible inference for 7B–30B parameter models at 4-bit quantization.

Practical example

On an RTX 3090 (24GB VRAM), running Llama 3 8B at 4-bit with ExLlamaV2 achieves 100-120 tokens/second, compared to ~60-80 tok/s with llama.cpp GPU offload. For a 30B model at 4-bit (16GB), ExLlamaV2 still fits in VRAM and runs at ~40-50 tok/s, while llama.cpp would need to offload layers to system RAM, dropping to ~10 tok/s.

Workflow example

In text-generation-webui, operators select the ExLlamaV2 loader from the Model tab, then choose a GPTQ quantized model (e.g., TheBloke/Llama-2-13B-GPTQ). The UI shows VRAM usage and tokens/second. Operators can adjust max_seq_len to fit context within VRAM. If VRAM is insufficient, ExLlamaV2 will error or fall back to CPU offload, which is much slower.

Related terms

QuantizationEXL2text-generation-webui (oobabooga)

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →