Q2_K or Q3 quantized model produces nonsense
Cause
Q2_K is too aggressive for most models below ~30B parameters. The 2-bit quantization causes severe enough quality degradation that the model becomes incoherent — looks fluent but says nothing meaningful, makes math errors, contradicts itself.
For 7B-13B models, Q4_K_M is the practical floor. Q3_K_M is borderline. Q2_K is only useable on 70B+ where there's enough redundancy in the weights to absorb the quality loss.
Solution
Drop to Q4_K_M minimum for any model under 30B:
ollama pull llama3.1:8b-instruct-q4_K_M
For 70B-class models where you legitimately need Q2_K to fit on consumer hardware, expect noticeable quality drop on:
- Multi-step reasoning (math, planning)
- Code generation correctness
- Strict instruction following
Better alternative for tight VRAM: an MoE model. Qwen 3 30B-A3B at Q4_K_M (18 GB) outperforms Llama 70B at Q2_K (24 GB) on most tasks because MoE active parameters retain quality.
Or use CPU offload instead of aggressive quantization:
# Llama 3.3 70B at Q4_K_M with 30/80 layers offloaded to CPU
./main -m llama-3.3-70b.Q4_K_M.gguf --n-gpu-layers 30
Slower (~12 tok/s instead of 35) but coherent.
Did this fix it?
If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.