Quantization issues

Q2_K or Q3 quantized model produces nonsense

Q: How do you fix "Q2_K or Q3 quantized model produces nonsense"?

**Drop to Q4_K_M minimum** for any model under 30B: ```bash ollama pull llama3.1:8b-instruct-q4_K_M ``` **For 70B-class models** where you legitimately need Q2_K to fit on consumer hardware, expect noticeable quality drop on: - Multi-step reasoning (math, planning) - Code generation correctness - Strict instruction following **Better alternative for tight VRAM**: an MoE model. Qwen 3 30B-A3B at Q4_K_M (18 GB) outperforms Llama 70B at Q2_K (24 GB) on most tasks because MoE active parameters retain quality. **Or use CPU offload** instead of aggressive quantization: ```bash # Llama 3.3 70B at Q4_K_M with 30/80 layers offloaded to CPU ./main -m llama-3.3-70b.Q4_K_M.gguf --n-gpu-layers 30 ``` Slower (~12 tok/s instead of 35) but coherent.

(no error — output is incoherent at Q2_K but fine at Q4_K_M)

By Fredoline Eruo · Last verified May 6, 2026

Cause

Q2_K is too aggressive for most models below ~30B parameters. The 2-bit quantization causes severe enough quality degradation that the model becomes incoherent — looks fluent but says nothing meaningful, makes math errors, contradicts itself.

For 7B-13B models, Q4_K_M is the practical floor. Q3_K_M is borderline. Q2_K is only useable on 70B+ where there's enough redundancy in the weights to absorb the quality loss.

Solution

Drop to Q4_K_M minimum for any model under 30B:

ollama pull llama3.1:8b-instruct-q4_K_M

For 70B-class models where you legitimately need Q2_K to fit on consumer hardware, expect noticeable quality drop on:

Multi-step reasoning (math, planning)
Code generation correctness
Strict instruction following

Better alternative for tight VRAM: an MoE model. Qwen 3 30B-A3B at Q4_K_M (18 GB) outperforms Llama 70B at Q2_K (24 GB) on most tasks because MoE active parameters retain quality.

Or use CPU offload instead of aggressive quantization:

# Llama 3.3 70B at Q4_K_M with 30/80 layers offloaded to CPU
./main -m llama-3.3-70b.Q4_K_M.gguf --n-gpu-layers 30

Slower (~12 tok/s instead of 35) but coherent.

Did this fix it?

If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.