Training & optimization
Q8_0 Quantization
Q8_0 is llama.cpp's simplest 8-bit GGUF quantization: weights in INT8, one FP16 scale per 32-element block, no zero-point. Each parameter takes about 8.5 bits including the scale.
Q8_0 is the "near-lossless" tier — perplexity is typically within 0.01 of FP16 on standard benchmarks. The cost is size: a 7B model is ~7.6 GB and a 70B is ~75 GB, only ~46% smaller than FP16. For most local-AI hardware, Q8_0 is overkill; Q5_K_M or Q4_K_M deliver 95%+ of the quality at half the memory.
When to actually pick Q8_0: when you're benchmarking quant impact and need a tight upper bound, or when running a model that's already close to your VRAM ceiling and you need every drop of fidelity.
Related terms
See also
Reviewed by Fredoline Eruo. See our editorial policy.
Buyer guides
When it doesn't work