Quantization quality loss — when the quant is the problem
Output quality drop after quantization usually means the bpw is too aggressive, KV cache quantization is too low, or the calibration data didn't match the model. Q4_K_M is the safe floor; below that needs care.
Diagnostic order — most likely first
Quantization tier too aggressive (Q3, Q2, IQ2)
Model produces incoherent text, repetitive loops, or off-topic responses. Worse than the FP16 baseline.
Bump up: Q3 → Q4_K_M is the cleanest jump. IQ2 → Q4_K_M nearly always improves. Q4_K_M is the modern standard floor — anything below that risks quality.
KV cache quantized too aggressively
Long-context coherence drops. Output starts good, degrades at 4K+ context. KV cache at Q4 hurts attention precision more than weight quant does.
Use FP16 or Q8 KV cache, not Q4. In llama.cpp: `--cache-type-k q8_0 --cache-type-v q8_0` (Q8 is the comfortable cache floor). Don't quantize KV below Q8 for long context.
Wrong quantization for the model architecture
Some models (Mixture-of-Experts, models with long shared embeddings) lose more quality at the same bpw than dense models. Output quality drops disproportionately.
Use higher bpw for MoE / non-standard architectures. Qwen 3 235B-A22B (MoE) needs Q5+ for stability. Dense Llama 70B is fine at Q4_K_M.
Calibration data mismatch (AWQ / GPTQ specific)
AWQ / GPTQ quants calibrated on English text underperform on code, multilingual, or specialized domains.
Find a quant calibrated on relevant data. Or fall back to GGUF Q4_K_M (calibration-free). For code workloads, prefer code-calibrated quants.
Comparing to a fine-tune you didn't actually quantize
You're using a quantized 'base' model and expecting the fine-tune behavior. The fine-tune lives on top; if you didn't quantize it specifically, the fine-tune's behavior is gone.
Check the model card. Many GGUF / EXL2 repos quantize the base, not the fine-tune. Find a quant of the specific fine-tune you want, or quantize it yourself.
Frequently asked questions
What's the safe minimum quantization for production?
Q4_K_M (GGUF) or 4.0 bpw (EXL2). Below this, quality degrades enough to be noticeable on adversarial prompts. Q5_K_M is the comfort zone for high-stakes work; Q8 is essentially lossless.
How do I measure quantization quality objectively?
Run perplexity on a held-out test set (`./llama-perplexity`) — lower is better. Compare your quant's PPL to the FP16 baseline; >1% increase is meaningful. For chat models, qualitative testing on diverse prompts is essential.
GGUF vs EXL2 vs AWQ — which has best quality at the same bpw?
Roughly equivalent at 4.0+ bpw. Below 4.0, EXL2's calibration tends to outperform GGUF Q3. AWQ uses activation-aware calibration that helps on specific architectures. For most users, GGUF Q4_K_M is the practical default.
Related troubleshooting
Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.
When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.
ExLlamaV2 load failures trace to wrong model format (needs EXL2 or EXL3, not GGUF), insufficient cache for context, or a driver/runtime version mismatch. The exl2 format is non-negotiable.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: