Q4_K_M vs Q5_K_M vs Q8_0 vs FP16 — settled with math, not forum folklore. Pick a model + hardware + use case + context length; we compute the memory budget, score every quant against your quality tolerance, and show you the curve.
Quality numbers are community-reported PPL deltas vs FP16 across Llama 3 / Qwen 3 / Mistral families on WikiText-2. Approximations, hedged — see methodology.
URL updates as you change fields — share the result by copying the URL.
Pick a model and hardware to see the recommendation.
We have 183 models and 103 hardware entries in the catalog.
One step back: pick the whole rig + runtime + models. Quant advisor drills into the model picks the stack builder makes.
If even Q2 doesn't fit, the answer isn't a quant — it's a different GPU.
Paste a real prompt — see what running it on this quant + hardware would cost vs Claude / GPT-5 / Together / Groq. Break-even analysis included.
Validate the recommendation against your actual full build, CPU + RAM + interconnect included.
The math behind the bars: how PPL deltas + KV cache formulas + tok/s estimation work, with sources for every number.