Which local AI runtimes support Multi-Token Prediction (MTP)?

The answer

One paragraph. No hedging beyond what the data actually warrants.

Multi-Token Prediction lets a model emit multiple tokens per forward pass, materially boosting throughput when the runtime knows how to handle the multi-head output. Support varies sharply by runtime:

Runtime	MTP support	Notes
vLLM (current builds)	✅ Full	The reference implementation. Real throughput gains visible. Check release notes for the exact version that landed it.
llama.cpp (recent builds, post-MTP merge)	✅ Full	CPU + GPU paths both work. Pin to a build dated after the MTP PR landed.
Ollama	⏳ Partial	Wraps llama.cpp but the multi-head decode path historically lags upstream. Check Ollama release notes for explicit MTP mentions before assuming throughput gain.
MLX-LM	⏳ Planned	Work-in-progress as of this writing; targeted for an upcoming release.
TensorRT-LLM	✅ Full	NVIDIA's enterprise runtime — MTP is a first-class feature.
llama-cpp-python	✅ Full (once upstream rolls forward)	Tracks llama.cpp; you get MTP after the wheels are rebuilt.

What this means for operators:

If you're on vLLM, Qwen 3.6 / DeepSeek V3 / any MTP-trained model is a free throughput win.
If you're on Ollama, you can run MTP-trained models but you're getting standard single-token output — no win until Ollama upstream catches up.
For production serving, vLLM is the right pick for MTP workloads.

Sanity check: when you upgrade a runtime and want to confirm MTP is actually active, watch the tokens-per-second over a 5-second window with the same hardware + model + prompt before and after the upgrade. A material speed-up is the visible signature — if numbers are identical to MTP-off, the runtime hasn't switched paths. We deliberately don't quote a specific multiplier: the gain depends on model architecture (MoE vs dense), batch size, and how aggressively MTP is configured.

Explore the numbers for your specific stack

Open runtimes (/tools) directory →

Full editorial pages for vLLM, llama.cpp, Ollama, MLX — including which features each one supports.

Where we got the numbers

vLLM v0.20.0 / v0.20.1 release notes (github.com/vllm-project/vllm/releases). llama.cpp PR #5742 thread + b9148 release. Ollama issue tracker for MTP-related discussions.

Also see

Should I upgrade to Qwen 3.6? →

The runtime-support question, applied to the specific upgrade decision.

vLLM tool page →

The reference MTP implementation. Editorial verdict + setup guidance.

Qwen 3.6 35B-A3B with MTP →

The model that's making MTP a hot topic in May 2026.

MTP, explained →

What multi-token prediction actually is at the model-architecture level.

Which local AI runtimes support Multi-Token Prediction (MTP)?

The answer

Explore the numbers for your specific stack

Where we got the numbers

Also see

Other questions in this thread