Which local AI runtimes support Multi-Token Prediction (MTP)?

Reviewed May 15, 20261 min read
mtpvllmllama-cppollamaruntimes

The answer

One paragraph. No hedging beyond what the data actually warrants.

Multi-Token Prediction lets a model emit multiple tokens per forward pass, materially boosting throughput when the runtime knows how to handle the multi-head output. Support varies sharply by runtime:

Runtime MTP support Notes
vLLM (current builds) ✅ Full The reference implementation. Real throughput gains visible. Check release notes for the exact version that landed it.
llama.cpp (recent builds, post-MTP merge) ✅ Full CPU + GPU paths both work. Pin to a build dated after the MTP PR landed.
Ollama ⏳ Partial Wraps llama.cpp but the multi-head decode path historically lags upstream. Check Ollama release notes for explicit MTP mentions before assuming throughput gain.
MLX-LM ⏳ Planned Work-in-progress as of this writing; targeted for an upcoming release.
TensorRT-LLM ✅ Full NVIDIA's enterprise runtime — MTP is a first-class feature.
llama-cpp-python ✅ Full (once upstream rolls forward) Tracks llama.cpp; you get MTP after the wheels are rebuilt.

What this means for operators:

  • If you're on vLLM, Qwen 3.6 / DeepSeek V3 / any MTP-trained model is a free throughput win.
  • If you're on Ollama, you can run MTP-trained models but you're getting standard single-token output — no win until Ollama upstream catches up.
  • For production serving, vLLM is the right pick for MTP workloads.

Sanity check: when you upgrade a runtime and want to confirm MTP is actually active, watch the tokens-per-second over a 5-second window with the same hardware + model + prompt before and after the upgrade. A material speed-up is the visible signature — if numbers are identical to MTP-off, the runtime hasn't switched paths. We deliberately don't quote a specific multiplier: the gain depends on model architecture (MoE vs dense), batch size, and how aggressively MTP is configured.

Where we got the numbers

vLLM v0.20.0 / v0.20.1 release notes (github.com/vllm-project/vllm/releases). llama.cpp PR #5742 thread + b9148 release. Ollama issue tracker for MTP-related discussions.

Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.