RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Errors / Tokenizer mismatches / Quantized model produces garbage / never stops generating
Tokenizer mismatches
Verified by owner

Quantized model produces garbage / never stops generating

(no error — output is incoherent, repeats, or generates until max tokens)
By Fredoline Eruo · Last verified May 8, 2026

Cause

The model is fine; the chat template (or stop tokens) doesn't match what it was trained on. Symptoms: assistant responses run on past the natural stopping point, get into role-play loops, or come out as gibberish from token one.

Two distinct causes:

  • Wrong chat template. Llama 3 uses <|start_header_id|>...<|end_header_id|>; ChatML uses <|im_start|>...<|im_end|>; Mistral uses [INST]...[/INST]. Mixing them looks like instruction-following from the runner's view but the model receives garbled boundaries.
  • EOS token not being respected. Many GGUFs ship with a generic <|endoftext|> while the model was trained to emit <|im_end|> or <|eot_id|>. The runner doesn't see the real stop signal.

Solution

1. Verify the chat template in the GGUF (or in tokenizer_config.json for HF models):

# llama.cpp — print the embedded template
./llama-cli -m model.gguf --chat-template chatml --prompt "test"
# Common values: chatml, llama2, llama3, mistral, gemma, vicuna, deepseek

If the embedded template is wrong/missing, override at runtime:

./llama-server -m model.gguf --chat-template llama3

2. Check the configured stop tokens. llama.cpp and Ollama derive these from the GGUF, but custom quants may have stripped them. In Ollama Modelfile:

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|eot_id|>"

3. Re-pull from a reputable uploader. Check bartowski/<model> or lmstudio-community/<model> on Hugging Face — they ship correct templates and stop tokens. Avoid uploads that only have Q2_K and no Q4 (often hasty conversions).

4. Confirm you're on the right chat format in your client. OpenWebUI, LM Studio, and Open WebUI each have a "chat template" override per-model — pick the one that matches the model card's documented format.

Related errors

  • Model loaded but tokenizer vocab size mismatch
  • TypeError: 'NoneType' object is not subscriptable in tokenizer
  • Model produces gibberish or repeats one token forever
  • OSError: Can't load tokenizer for ... / no file named tokenizer.json

Did this fix it?

If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.