What causes "Quantized model produces garbage / never stops generating"?

The model is fine; the chat template (or stop tokens) doesn't match what it was trained on. Symptoms: assistant responses run on past the natural stopping point, get into role-play loops, or come out as gibberish from token one. Two distinct causes: - **Wrong chat template.** Llama 3 uses ` ... `; ChatML uses ` ... `; Mistral uses `[INST]...[/INST]`. Mixing them looks like instruction-following from the runner's view but the model receives garbled boundaries. - **EOS token not being respected.** Many GGUFs ship with a generic ` ` while the model was trained to emit ` ` or ` `. The runner doesn't see the real stop signal.

Quantized model produces garbage / never stops generating — fix and explanation

Q: How do you fix "Quantized model produces garbage / never stops generating"?

**1. Verify the chat template** in the GGUF (or in tokenizer_config.json for HF models): ```bash # llama.cpp — print the embedded template ./llama-cli -m model.gguf --chat-template chatml --prompt "test" # Common values: chatml, llama2, llama3, mistral, gemma, vicuna, deepseek ``` If the embedded template is wrong/missing, override at runtime: ```bash ./llama-server -m model.gguf --chat-template llama3 ``` **2. Check the configured stop tokens.** llama.cpp and Ollama derive these from the GGUF, but custom quants may have stripped them. In Ollama Modelfile: ``` PARAMETER stop " " PARAMETER stop " " ``` **3. Re-pull from a reputable uploader.** Check `bartowski/ ` or `lmstudio-community/ ` on Hugging Face — they ship correct templates and stop tokens. Avoid uploads that only have Q2_K and no Q4 (often hasty conversions). **4. Confirm you're on the right chat format** in your client. OpenWebUI, LM Studio, and Open WebUI each have a "chat template" override per-model — pick the one that matches the model card's documented format.

Cause

The model is fine; the chat template (or stop tokens) doesn't match what it was trained on. Symptoms: assistant responses run on past the natural stopping point, get into role-play loops, or come out as gibberish from token one.

Two distinct causes:

Wrong chat template. Llama 3 uses <|start_header_id|>...<|end_header_id|>; ChatML uses <|im_start|>...<|im_end|>; Mistral uses [INST]...[/INST]. Mixing them looks like instruction-following from the runner's view but the model receives garbled boundaries.
EOS token not being respected. Many GGUFs ship with a generic <|endoftext|> while the model was trained to emit <|im_end|> or <|eot_id|>. The runner doesn't see the real stop signal.

Solution

1. Verify the chat template in the GGUF (or in tokenizer_config.json for HF models):

# llama.cpp — print the embedded template
./llama-cli -m model.gguf --chat-template chatml --prompt "test"
# Common values: chatml, llama2, llama3, mistral, gemma, vicuna, deepseek

If the embedded template is wrong/missing, override at runtime:

./llama-server -m model.gguf --chat-template llama3

2. Check the configured stop tokens. llama.cpp and Ollama derive these from the GGUF, but custom quants may have stripped them. In Ollama Modelfile:

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|eot_id|>"

3. Re-pull from a reputable uploader. Check bartowski/<model> or lmstudio-community/<model> on Hugging Face — they ship correct templates and stop tokens. Avoid uploads that only have Q2_K and no Q4 (often hasty conversions).

4. Confirm you're on the right chat format in your client. OpenWebUI, LM Studio, and Open WebUI each have a "chat template" override per-model — pick the one that matches the model card's documented format.

Quantized model produces garbage / never stops generating

Cause

Solution

Related errors

Did this fix it?