Speculative Decoding
Speculative decoding speeds up LLM inference by using a small fast "draft" model to propose the next several tokens, then verifying them all in parallel with the large "target" model. When the draft is right (which it often is for routine tokens), you get 2-4× speedup; when wrong, you fall back to the standard autoregressive flow.
The key insight: verifying N tokens with the target model takes only one forward pass, while generating them autoregressively takes N. The draft model burns extra compute but saves more in reduced target-model passes.
For local AI: pair a 1B draft model with a 7B-70B target model from the same family (same tokenizer, similar training). llama.cpp supports this via --draft-model, vLLM via --speculative-model. Real speedups vary 1.5-3× depending on workload — code completion benefits most; creative writing benefits least.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.