Q4_0 Quantization

Q4_0 is the original llama.cpp 4-bit quantization: INT4 weights with one FP16 scale per 32-element block, no zero-point, no importance matrix. Each parameter takes ~4.5 bits.

Q4_0 has been superseded by Q4_K_M for almost all use cases. It's faster than K-quants on some old hardware (no per-row importance matrix to apply) but quality is worse — perplexity is typically 0.3–0.5 points above FP16 vs 0.1–0.2 for Q4_K_M.

Still seen in the wild because some early GGUF model releases shipped only Q4_0 and Q8_0. If you have a choice, prefer Q4_K_M.

Related terms

See also