RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Training & optimization / HQQ (Half-Quadratic Quantization)
Training & optimization

HQQ (Half-Quadratic Quantization)

Also known as: half-quadratic quantization

HQQ (Half-Quadratic Quantization) is a calibration-free quantization method that produces 2-, 3-, 4-, and 8-bit weight quantizations for transformer models without needing a calibration dataset, unlike GPTQ or AWQ. HQQ formulates the quantization problem as half-quadratic optimization, solving for per-channel scales and zero points directly. The lack of calibration data makes it fast to apply — quantizing a 7B model can take a few minutes versus the hours that GPTQ-style methods need.

Deeper dive

Most post-training quantization methods (GPTQ, AWQ) use a small calibration dataset to estimate which weights are sensitive and need higher-precision treatment. This works well but adds a tuning step, biases the quantization toward the calibration distribution, and is slow. HQQ skips this entirely by solving an analytic optimization that minimizes the quantization error directly on the weights themselves, using the half-quadratic algorithm. The result is competitive perplexity at 4-bit, especially good results at 3- and 2-bit where calibration-based methods often degrade sharply, and a much faster quantization workflow. The tradeoff is that HQQ tends to be slightly worse than calibration-tuned methods at 4-bit on instruction-following tasks where the calibration data closely matched the eval distribution — but this gap shrinks at lower bit widths where calibration overfit becomes the dominant failure mode.

Practical example

An operator wanting to run Llama 3.1 70B on a 24 GB card has to pick a quantization. Q4_K_M GGUF (40 GB) needs CPU offload. Q3_K_M GGUF (30 GB) still needs offload. HQQ 2-bit quantization can compress 70B to ~20 GB and stay entirely in VRAM — the quality regression vs. Q4 is measurable but often acceptable for chat-grade workloads. The Hugging Face quanto library and the standalone hqq package implement the method; both produce weights loadable through standard AutoModelForCausalLM.

Workflow example

Installation is pip install hqq. Quantizing a checkpoint takes a single Python call: from hqq.models.hf.base import AutoHQQHFModel; model = AutoHQQHFModel.from_pretrained('meta-llama/Llama-3.1-8B-Instruct'); AutoHQQHFModel.quantize_model(model, quant_config={'nbits': 4, 'group_size': 64}). Saving and loading uses the same HF interface. To use the quantized model in production, point any HF-compatible inference engine (Hugging Face Transformers, TGI, or a custom loop) at the saved directory; for llama.cpp / Ollama-style workflows, convert to GGUF after quantization rather than expecting HQQ-format support natively.

Related terms

QuantizationAWQGPTQ

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
When it doesn't work
  • Quantization quality loss →
  • GGUF tokenizer mismatch →