warningEditorialReviewed May 2026

Quantization quality loss — when the quant is the problem

Output quality drop after quantization usually means the bpw is too aggressive, KV cache quantization is too low, or the calibration data didn't match the model. Q4_K_M is the safe floor; below that needs care.

llama.cpp GGUFExLlamaV2 EXL2AWQGPTQany quantized inference

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

Quantization tier too aggressive (Q3, Q2, IQ2)

Diagnose

Model produces incoherent text, repetitive loops, or off-topic responses. Worse than the FP16 baseline.

Fix

Bump up: Q3 → Q4_K_M is the cleanest jump. IQ2 → Q4_K_M nearly always improves. Q4_K_M is the modern standard floor — anything below that risks quality.

KV cache quantized too aggressively

Diagnose

Long-context coherence drops. Output starts good, degrades at 4K+ context. KV cache at Q4 hurts attention precision more than weight quant does.

Fix

Use FP16 or Q8 KV cache, not Q4. In llama.cpp: `--cache-type-k q8_0 --cache-type-v q8_0` (Q8 is the comfortable cache floor). Don't quantize KV below Q8 for long context.

Wrong quantization for the model architecture

Diagnose

Some models (Mixture-of-Experts, models with long shared embeddings) lose more quality at the same bpw than dense models. Output quality drops disproportionately.

Fix

Use higher bpw for MoE / non-standard architectures. Qwen 3 235B-A22B (MoE) needs Q5+ for stability. Dense Llama 70B is fine at Q4_K_M.

Calibration data mismatch (AWQ / GPTQ specific)

Diagnose

AWQ / GPTQ quants calibrated on English text underperform on code, multilingual, or specialized domains.

Fix

Find a quant calibrated on relevant data. Or fall back to GGUF Q4_K_M (calibration-free). For code workloads, prefer code-calibrated quants.

Comparing to a fine-tune you didn't actually quantize

Diagnose

You're using a quantized 'base' model and expecting the fine-tune behavior. The fine-tune lives on top; if you didn't quantize it specifically, the fine-tune's behavior is gone.

Fix

Check the model card. Many GGUF / EXL2 repos quantize the base, not the fine-tune. Find a quant of the specific fine-tune you want, or quantize it yourself.

Frequently asked questions

What's the safe minimum quantization for production?

Q4_K_M (GGUF) or 4.0 bpw (EXL2). Below this, quality degrades enough to be noticeable on adversarial prompts. Q5_K_M is the comfort zone for high-stakes work; Q8 is essentially lossless.

How do I measure quantization quality objectively?

Run perplexity on a held-out test set (`./llama-perplexity`) — lower is better. Compare your quant's PPL to the FP16 baseline; >1% increase is meaningful. For chat models, qualitative testing on diverse prompts is essential.

GGUF vs EXL2 vs AWQ — which has best quality at the same bpw?

Roughly equivalent at 4.0+ bpw. Below 4.0, EXL2's calibration tends to outperform GGUF Q3. AWQ uses activation-aware calibration that helps on specific architectures. For most users, GGUF Q4_K_M is the practical default.

Related troubleshooting

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

GGUF tokenizer mismatch / 'tokenizer model not found'

When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.

ExLlamaV2: model not loading / 'Could not find model index' / cache OOM

ExLlamaV2 load failures trace to wrong model format (needs EXL2 or EXL3, not GGUF), insufficient cache for context, or a driver/runtime version mismatch. The exl2 format is non-negotiable.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?