Quantized model produces garbage / never stops generating
Cause
The model is fine; the chat template (or stop tokens) doesn't match what it was trained on. Symptoms: assistant responses run on past the natural stopping point, get into role-play loops, or come out as gibberish from token one.
Two distinct causes:
- Wrong chat template. Llama 3 uses
<|start_header_id|>...<|end_header_id|>; ChatML uses<|im_start|>...<|im_end|>; Mistral uses[INST]...[/INST]. Mixing them looks like instruction-following from the runner's view but the model receives garbled boundaries. - EOS token not being respected. Many GGUFs ship with a generic
<|endoftext|>while the model was trained to emit<|im_end|>or<|eot_id|>. The runner doesn't see the real stop signal.
Solution
1. Verify the chat template in the GGUF (or in tokenizer_config.json for HF models):
# llama.cpp — print the embedded template
./llama-cli -m model.gguf --chat-template chatml --prompt "test"
# Common values: chatml, llama2, llama3, mistral, gemma, vicuna, deepseek
If the embedded template is wrong/missing, override at runtime:
./llama-server -m model.gguf --chat-template llama3
2. Check the configured stop tokens. llama.cpp and Ollama derive these from the GGUF, but custom quants may have stripped them. In Ollama Modelfile:
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|eot_id|>"
3. Re-pull from a reputable uploader. Check bartowski/<model> or lmstudio-community/<model> on Hugging Face — they ship correct templates and stop tokens. Avoid uploads that only have Q2_K and no Q4 (often hasty conversions).
4. Confirm you're on the right chat format in your client. OpenWebUI, LM Studio, and Open WebUI each have a "chat template" override per-model — pick the one that matches the model card's documented format.
Related errors
Did this fix it?
If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.