Transformer & LLM components

Token

A token is the smallest unit of text a language model processes. Most modern models use subword tokenization, where common words map to single tokens and rare words split into multiple subtokens. As a rough rule, 1 token ≈ 0.75 English words ≈ 4 characters.

Tokens matter because models are billed (in cloud APIs) and constrained (in context window) per-token, not per-word. A 100K-token context fits roughly 75,000 English words. Code typically tokenizes denser than prose due to whitespace and operators.

Different model families use different tokenizers: GPT models use BPE; Llama uses SentencePiece; Mistral Nemo introduced the new Tekken tokenizer. Switching tokenizers between training and inference produces gibberish, which is why tools like llama.cpp ship with the model's tokenizer baked in.

Reviewed by Fredoline Eruo. See our editorial policy.

Related terms