← Back to Red Deer Investments  ·  AI Library Home

Multi-Token Prediction: What If the Main Model Did the Drafting?

Speculative decoding requires a second model — either a separate one or attached draft heads. But there's a more elegant approach: what if you simply trained the main model to predict multiple future tokens as part of its normal operation?

This is Multi-Token Prediction (MTP). Meta FAIR published the foundational paper in 2024, showing you could modify a standard transformer to predict the next 2nd, 3rd, and 4th tokens simultaneously with the main prediction. You add extra output heads — tiny extra layers — but they all share the same underlying model parameters and vocabulary embedding, so the overhead is minimal.

During training, the model learns not just "what comes next" but "what comes after that too." This forces the model to develop representations that are more forward-looking, richer, more structured. A chess player who considers three moves ahead develops a different — and better — understanding of the board than one who only considers the immediate next move. The same logic applies to language models.

Meta's results were striking: 3.0x speedup on code, 2.7x on text, with the extra benefit that the model's representations improved even for standard next-token prediction. MTP-trained models did better on benchmarks that had nothing to do with speed.

DeepSeek adopted MTP in their V3 and R1 models, but with a twist: instead of parallel independent heads, they chained the predictions serially. The model predicts token t+1. Then an MTP module takes that prediction and the internal state to predict t+2. Then t+3. Each prediction builds on the one before it, preserving the causal chain. This is more accurate than parallel prediction (since later tokens benefit from knowing earlier ones) but adds a small sequential dependency.

DeepSeek's results: 85--90% acceptance rate for the first draft token, yielding a 1.8x throughput improvement on their 671B-parameter MoE model. The acceptance rates were high enough that the system could run MTP as a production optimization, not just a research curiosity.

← Previous Next →