Quantization tradeoffs
EditorialFP16 vs Q8 vs Q5 vs Q4 vs Q3 vs Q2
Quantization compresses model weights by reducing bits per parameter. Less memory, more speed, less quality. The tradeoff curve isn't linear — Q8 is nearly free; Q4 is the production sweet spot; Q3 starts breaking; Q2 is mostly a toy.
| Dimension | FP16/BF16 Reference | Q8/INT8 Near-lossless | Q5/5-bit Quality-tilt | Q4/INT4 Default deploy | Q3/3-bit Aggressive | Q2/2-bit Last resort |
|---|---|---|---|---|---|---|
Memory footprint Approx. weight size for a 70B-parameter model. | Limited ≈140 GB. Multi-GPU only. | Acceptable ≈70 GB. Two-card 48 GB territory. | Strong ≈48 GB. Fits a single 48 GB card or split. | Excellent ≈40 GB. The 70B-on-one-pro-card sweet spot. | Excellent ≈32 GB. Squeezes 70B onto consumer + headroom. | Excellent ≈22 GB. Fits but quality is uneven. |
Output quality (instruction-following) Operator-perceivable quality on chat / coding tasks. | Excellent Reference. Anything else is measured against this. | Excellent Within 1-2% of FP16 on most benchmarks; effectively lossless for chat. | Strong Quality holds. Some perceptible drift on complex coding. | Strong Production-acceptable for chat/RAG. K-quants (Q4_K_M) noticeably better than older Q4_0. | Acceptable Visible degradation; usable for casual chat, not for code. | Limited Hallucinations + reasoning collapse common; experimental tier. |
Decode speed (tok/s) Lower-bit weights = faster decode (memory-bound regime). | Acceptable Slowest among compute-rich GPUs; memory bandwidth bound. | Strong ≈1.5-1.8x FP16 on consumer hardware. | Strong ≈2-2.4x FP16; the K-quants are kernel-optimized. | Excellent ≈2.5-3x FP16; dominant production tier. | Excellent ≈3-3.5x FP16 if kernels exist; gains often outweighed by quality loss. | Excellent Fastest but only matters if quality is acceptable for the task. |
Runtime support Which engines have stable kernels. | Excellent Universal. | Excellent All major runtimes. | Strong GGUF + ExLlamaV2; AWQ has Q4 only. | Excellent GGUF Q4_K_M, AWQ-INT4, GPTQ, EXL2-4bpw. | Acceptable GGUF + EXL2; mainstream kernel coverage incomplete. | Limited GGUF only; experimental. |
Multi-GPU friendliness How well the quant splits across cards. | Strong Tensor-parallel native. | Strong Tensor-parallel works in vLLM. | Acceptable Layer-split common; tensor-parallel less universal. | Strong AWQ-INT4 and GPTQ work with vLLM tensor-parallel. | Limited Layer-split only on most engines. | Limited Layer-split only. |
Operator recommendation What we'd default to for a given goal. | Limited Avoid unless your hardware has the VRAM and you need reference quality. | Strong Best for reproducing leaderboard numbers; preserve quality at 2x cost. | Strong Quality-tilted production deploy; matters when output is read by humans. | Excellent Default for 90% of operators. Production quality, sane VRAM. | Acceptable When you must squeeze in; expect coding regression. | Limited Toy / experimental. Don't ship. |
Operator defaults
If quality matters (coding, agents, knowledge work): start at Q5 or Q8. Drop to Q4 only after measuring against your task.
If you're VRAM-bound: Q4_K_M is the universal default. Don't sub-Q4 without testing on YOUR workload.
If you're shipping a product: measure quality on representative inputs at Q8 vs Q4 vs Q5 and pick the smallest one your users won't notice. Don't take leaderboards at face value — chat-quality is workload-specific.