Training & optimization
Q4_0 Quantization
Q4_0 is the original llama.cpp 4-bit quantization: INT4 weights with one FP16 scale per 32-element block, no zero-point, no importance matrix. Each parameter takes ~4.5 bits.
Q4_0 has been superseded by Q4_K_M for almost all use cases. It's faster than K-quants on some old hardware (no per-row importance matrix to apply) but quality is worse — perplexity is typically 0.3–0.5 points above FP16 vs 0.1–0.2 for Q4_K_M.
Still seen in the wild because some early GGUF model releases shipped only Q4_0 and Q8_0. If you have a choice, prefer Q4_K_M.
Related terms
See also
Reviewed by Fredoline Eruo. See our editorial policy.
Buyer guides
When it doesn't work