Quant Advisor

Q4_K_M vs Q5_K_M vs Q8_0 vs FP16 — settled with math, not forum folklore. Pick a model + hardware + use case + context length; we compute the memory budget, score every quant against your quality tolerance, and show you the curve.

Quality numbers are community-reported PPL deltas vs FP16 across Llama 3 / Qwen 3 / Mistral families on WikiText-2. Approximations, hedged — see methodology.

Inputs

URL updates as you change fields — share the result by copying the URL.

Model?

Hardware?

Use case?

Context length?

KV cache dtype?

Pick a model and hardware to see the recommendation.

We have 183 models and 103 hardware entries in the catalog.

Where to go from here

Stack Builder →

One step back: pick the whole rig + runtime + models. Quant advisor drills into the model picks the stack builder makes.

GPU chooser →

If even Q2 doesn't fit, the answer isn't a quant — it's a different GPU.

Cost vs Cloud →

Paste a real prompt — see what running it on this quant + hardware would cost vs Claude / GPT-5 / Together / Groq. Break-even analysis included.

Will it run? (custom build) →

Validate the recommendation against your actual full build, CPU + RAM + interconnect included.

Methodology →

The math behind the bars: how PPL deltas + KV cache formulas + tok/s estimation work, with sources for every number.