Quantization formats for local AI — GGUF, AWQ, GPTQ, EXL2, FP8, MLX
The cross-runtime quant format reference. What GGUF / AWQ / GPTQ / EXL2 / FP8 / MLX-4bit actually do, when each wins, real quality-vs-VRAM tradeoffs, and how to pick the right one without trusting marketing.
Why quantization exists
A 70B-parameter model in FP16 takes 140 GB of VRAM. No consumer GPU has 140 GB. Quantization is the bridge: it shrinks model weights from FP16 (16 bits per parameter) to 4-8 bits per parameter, with editorial-grade quality loss (typically 1-3% on reasoning benchmarks) but 3-4× memory savings. A Llama 3.3 70B in FP16 needs an H100 80GB or datacenter cluster; in Q4_K_M it fits on a single RTX 4090 (24 GB) or dual RTX 3090 NVLink with comfortable headroom.
The real cost of quantization isn't the quality drop — that's well-characterized at 4-bit and above. The real cost is format fragmentation: every runtime favors a different quant family, and the wrong choice means rebuilding your stack. This guide is the cross-runtime reference for picking the right format for your hardware + runtime + budget.
How to read a quant format name
Quant names look cryptic. The decoder ring:
- Bit-width prefix:
Q4= 4-bit,Q5= 5-bit,Q8= 8-bit.FP8= 8-bit floating-point,INT4= 4-bit integer. - Quant family:
K_M/K_S= K-quants in llama.cpp (M=medium quality, S=small/aggressive).0/1= legacy GGML quants.AWQ,GPTQ,EXL2= format-specific names. - Group size (where applicable):
g32,g64,g128— granularity at which scale/zero-point parameters are stored. Smaller group = higher quality, larger file.
Common mappings: Q4_K_M ≈ 4.83 bits/parameter average (mixed K-quant), Q5_K_M ≈ 5.69 bits/param, Q8_0 ≈ 8.5 bits/param, AWQ-INT4 ≈ 4.25 bits/param, FP8 = exactly 8 bits/param. Naive “params × 4 / 8 = file size” under-predicts actual disk size by ~10-20%.
GGUF — the portable default
GGUF (GPT-Generated Unified Format) is the llama.cpp file format and the most-portable quant ecosystem. Single file contains the weights + tokenizer + chat template + metadata. K-quants (Q4_K_M, Q5_K_M, Q8_0) are the modern default; legacy quants (Q4_0, Q4_1) exist for backward compatibility but are operationally inferior.
Strengths: runs on every backend — CPU, NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, Intel SYCL, ARM. The same file works across llama.cpp, Ollama, LM Studio, KoboldCPP, llamafile. K-quant kernel coverage in 2026 is excellent across all platforms.
Weaknesses: throughput on production-serving runtimes (vLLM, SGLang) trails AWQ by 15-30% because the K-quant kernel isn't as optimized for continuous batching as the AWQ-Marlin kernel pair. GGUF is operationally the right choice for portability + single-user inference, not for production multi-tenant serving.
AWQ — the production-NVIDIA serving leader
AWQ (Activation-aware Weight Quantization) identifies “salient” weight channels via activation analysis and protects them at higher precision while aggressively quantizing the rest. 4-bit only. NVIDIA-only.
Why AWQ wins for vLLM/SGLang: the AWQ → Marlin kernel is the most-optimized 4-bit serving kernel on H100/Ada/Ampere. With PagedAttention and continuous batching, AWQ-INT4 hits 95-98% of FP16 throughput on Hopper at 25% of the VRAM. That's why every production NVIDIA serving stack defaults to AWQ: it's the throughput-per-VRAM-dollar leader.
Operational caveats: AWQ requires a calibration dataset (default ones in AutoAWQ are usually fine). Some quant variants (e.g. early Llama 4 AWQ checkpoints) have tokenizer-alignment issues with tool calling — verify clean JSON output before committing. AWQ-INT3 exists but the quality drop is meaningful (~5-7% additional vs INT4); stay at INT4 unless VRAM is the absolute constraint.
GPTQ — the predecessor
GPTQ (Generative Pre-trained Transformer Quantization) was the dominant 4-bit format from 2023-2024. Uses approximate second-order Hessian information to choose quant values. NVIDIA-only.
When GPTQ matters in 2026: the model only ships with GPTQ quants and AWQ isn't available; you're running ExLlamaV2 (which uses GPTQ-derived EXL2 quants); you need 3-bit quantization (more aggressive than AWQ's 4-bit floor). For new vLLM deployments, AWQ is generally the better choice — the GPTQ kernel optimization gets less attention from the vLLM team.
EXL2 — the consumer-NVIDIA single-stream leader
EXL2 is the ExLlamaV2 format. Allows fractional bit-rates (e.g. 4.65 bits/weight) by mixing higher-precision weights for important channels with lower-precision elsewhere. NVIDIA-only.
The EXL2 case operationally: fastest single-stream tok/s on consumer NVIDIA. Beats vLLM AWQ by 10-25% per-stream throughput at the same VRAM target. The dual-3090 NVLink + ExLlamaV2 combo is one of the highest single-stream-throughput setups under $2,000 in 2026.
Compatibility cost: EXL2 doesn't support continuous batching the way vLLM does, so multi-tenant concurrency is weaker. EXL2 doesn't run on AMD or Apple. For solo-user or small-team deployments where peak per-stream tok/s matters: EXL2. For multi-user production serving: AWQ via vLLM.
FP8 — the H100 / Ada transformer-engine path
FP8 is 8-bit floating-point quantization that uses NVIDIA's Hopper or Ada transformer engine for hardware-accelerated dequantization. Two variants: E4M3 (4 exponent bits, 3 mantissa) for weights, E5M2 (5 exponent, 2 mantissa) for activations.
When FP8 wins: H100 / H200 / B100 production deployments where you need quality close to FP16 (FP8 typically hits ~1% quality loss vs FP16, vs ~2% for AWQ-INT4) AND can pay the 2× VRAM cost vs INT4. RTX 4090 and 5090 also support FP8 on the Ada/Blackwell transformer engine but the throughput uplift is smaller than on Hopper.
FP8 doesn't help on AMD or Apple. Some FP8 quants require precise vLLM/TensorRT-LLM versions — verify the quant's manifest before committing.
MLX-4bit / MLX-8bit — the Apple Silicon path
MLX quants are Apple-Silicon-native formats tuned for the Apple unified memory architecture. Different file format (no GGUF compatibility). Used by MLX-LM and MLX Swift.
Operational reality on Apple Silicon: MLX-4bit is the fastest path on M-series Macs for most models. llama.cpp Metal works (and uses GGUF) but trails MLX by 10-30% on tok/s for the same model. The trade-off: MLX-format models don't cross-deploy to Linux/Windows NVIDIA — you maintain dual quant variants if you ship to both Apple and CUDA targets.
Apple Foundation Models APIs (iOS 18+) ship pre-quantized variants for Apple Intelligence. These are system-bundled and not directly runnable through MLX-LM — they're Apple's closed-quant pipeline.
Marlin — the high-throughput INT4 kernel
Marlin is technically a CUDA kernel, not a quant format — but it's the kernel that makes AWQ-INT4 fast on vLLM. Mixed-precision (INT4 weights, FP16 activations) with hand-tuned CUDA tensor-core throughput.
You don't pick “Marlin” as a format; you pick AWQ-INT4 + vLLM 0.6+, and Marlin runs under the hood. Knowing it exists matters because: (1) it's why AWQ beats GPTQ on vLLM throughput, (2) Marlin-compatible quants have specific calibration constraints — non-standard group-size AWQ quants may fall back to slower kernels.
Quality vs VRAM tradeoff matrix
Approximate quality loss vs FP16 across formats (editorial estimate, varies by model):
| Format | Bits/param | Quality loss vs FP16 | VRAM savings vs FP16 |
|---|---|---|---|
| FP16 | 16 | baseline | baseline |
| FP8 (E4M3) | 8 | ~1% | ~2× |
| Q8_0 (GGUF) | 8.5 | ~1% | ~1.9× |
| Q6_K (GGUF) | 6.5 | ~1.5% | ~2.5× |
| Q5_K_M (GGUF) | 5.7 | ~2% | ~2.8× |
| AWQ-INT4 | 4.25 | ~2% | ~3.7× |
| Q4_K_M (GGUF) | 4.83 | ~2.5% | ~3.3× |
| EXL2 4.65bpw | 4.65 | ~2.5% | ~3.4× |
| GPTQ-INT4 | 4.25 | ~3-5% | ~3.7× |
| Q3_K_M (GGUF) | 3.9 | ~5-8% | ~4.1× |
| AWQ-INT3 | 3.25 | ~7-10% | ~5× |
| Q2_K (GGUF) | 3.0 | ~15%+ | ~5.3× |
Operator rule of thumb: Q4_K_M / AWQ-INT4 is the knee of the curve — most VRAM savings for acceptable quality loss. Going below Q4 is rarely worth it on coding / reasoning workloads. Going above Q5 is only worth it when you have VRAM to burn and need every point of MMLU / GPQA quality.
Compatibility matrix by runtime
See the full runtime compatibility matrix for cross-axes. Format coverage:
| Runtime | GGUF | AWQ | GPTQ | EXL2 | FP8 | MLX |
|---|---|---|---|---|---|---|
| llama.cpp / Ollama / LM Studio | ✓ | · | · | · | · | · |
| vLLM | · | ✓ | ✓ | · | ✓ | · |
| SGLang | · | ✓ | ✓ | · | ✓ | · |
| ExLlamaV2 / TabbyAPI | · | · | ✓ | ✓ | · | · |
| TensorRT-LLM | · | ✓ | · | · | ✓ | · |
| MLX-LM / MLX Swift | · | · | · | · | · | ✓ |
How to pick: decision tree
- What hardware? Apple Silicon → MLX-4bit. AMD GPU → GGUF Q4_K_M (ROCm path). Intel Arc / NPU → ONNX Runtime quants. NVIDIA → continue.
- Multi-user serving (vLLM/SGLang)? → AWQ-INT4. The throughput-per-VRAM-dollar leader. FP8 only if H100 + 1% quality matters more than 2× VRAM cost.
- Solo user, max single-stream tok/s on consumer NVIDIA? → EXL2 via ExLlamaV2 + TabbyAPI.
- Cross-platform portability matters? → GGUF Q4_K_M. Runs on every backend.
- VRAM is brutally tight? → drop to Q3_K_M or AWQ-INT3 with a quality hit. Don't go lower without measuring quality on your specific workload.
Quant-related failure modes
- Tokenizer alignment broken on new quants. Some early-cycle AWQ checkpoints emit malformed JSON for tool calls. Test with a tool-calling prompt before committing to a quant variant.
- FP8 quant pipeline mismatches. Some FP8 checkpoints require specific vLLM / TensorRT-LLM versions. Verify the quant's manifest.
- MTP head silently dropped during quantization. DeepSeek V4's multi-token-prediction head can be lost during AWQ conversion if the pipeline doesn't handle it. Test throughput post-quant — losing MTP costs ~1.8× decode speed.
- Quant + flash-attention version conflict. FA2 vs FA3 vs FlashInfer have subtly different kernel requirements; some quants compile only with specific FA versions. Pin all three.
- K-quant kernel coverage gaps on AMD/Vulkan. Some K-quants (especially Q3 variants) lack optimized kernels on AMD ROCm or Vulkan. Stick to Q4_K_M / Q5_K_M / Q8_0 for non-CUDA backends until verified.
Going deeper
- Quantization glossary entry — the broader concept
- GGUF glossary entry
- AWQ glossary entry
- EXL2 glossary entry
- Runtime compatibility matrix — full cross-axis
- Setup path-finder — runtime + quant by your hardware
- Multi-GPU buying guide