Training & optimization

HQQ (Half-Quadratic Quantization)

Also known as: half-quadratic quantization

HQQ (Half-Quadratic Quantization) is a calibration-free quantization method that produces 2-, 3-, 4-, and 8-bit weight quantizations for transformer models without needing a calibration dataset, unlike GPTQ or AWQ. HQQ formulates the quantization problem as half-quadratic optimization, solving for per-channel scales and zero points directly. The lack of calibration data makes it fast to apply — quantizing a 7B model can take a few minutes versus the hours that GPTQ-style methods need.

Deeper dive

Most post-training quantization methods (GPTQ, AWQ) use a small calibration dataset to estimate which weights are sensitive and need higher-precision treatment. This works well but adds a tuning step, biases the quantization toward the calibration distribution, and is slow. HQQ skips this entirely by solving an analytic optimization that minimizes the quantization error directly on the weights themselves, using the half-quadratic algorithm. The result is competitive perplexity at 4-bit, especially good results at 3- and 2-bit where calibration-based methods often degrade sharply, and a much faster quantization workflow. The tradeoff is that HQQ tends to be slightly worse than calibration-tuned methods at 4-bit on instruction-following tasks where the calibration data closely matched the eval distribution — but this gap shrinks at lower bit widths where calibration overfit becomes the dominant failure mode.

Practical example

An operator wanting to run Llama 3.1 70B on a 24 GB card has to pick a quantization. Q4_K_M GGUF (40 GB) needs CPU offload. Q3_K_M GGUF (30 GB) still needs offload. HQQ 2-bit quantization can compress 70B to ~20 GB and stay entirely in VRAM — the quality regression vs. Q4 is measurable but often acceptable for chat-grade workloads. The Hugging Face quanto library and the standalone hqq package implement the method; both produce weights loadable through standard AutoModelForCausalLM.

Workflow example

Installation is pip install hqq. Quantizing a checkpoint takes a single Python call: from hqq.models.hf.base import AutoHQQHFModel; model = AutoHQQHFModel.from_pretrained('meta-llama/Llama-3.1-8B-Instruct'); AutoHQQHFModel.quantize_model(model, quant_config={'nbits': 4, 'group_size': 64}). Saving and loading uses the same HF interface. To use the quantized model in production, point any HF-compatible inference engine (Hugging Face Transformers, TGI, or a custom loop) at the saved directory; for llama.cpp / Ollama-style workflows, convert to GGUF after quantization rather than expecting HQQ-format support natively.

Related terms

Quantization AWQ GPTQ

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

Best GPU for local AI →

When it doesn't work