← Back to Red Deer Investments  ·  AI Library Home

Chapter 5: The Recovery Training

Teaching a Compressed Model to Perform Again


Chapter Goals

By the end of this chapter, you will understand:

  • Why a compressed model needs retraining after pruning
  • The different training objectives evaluated in the paper
  • Why combining distillation with language modeling loss works better than either alone
  • What multi-token prediction is and why it's exciting

After surgery comes rehabilitation. The SlimQwen paper's researchers were emphatic about this point: you cannot prune a model and expect it to perform well out of the gate. The compression process inevitably degrades performance. The question is how to train the model back to health.


The Recovery Problem

Imagine a surgeon removing 75% of a marathon runner's muscle tissue. Even if the surgeon only removed the "least important" tissue, the runner isn't going to run their personal best the next day. They need a structured recovery program.

Similarly, when you prune a language model — deleting layers, shrinking dimensions, merging experts — you've removed capacity that was contributing to the model's predictions. The model must recover lost capability through continued training.


Applied Definitions

Language Modeling (LM) Loss — The standard training objective for language models. The model predicts the next word in a sequence. If it gets it right, great. If wrong, the error teaches it to adjust. This is how the original model was trained and how most models learn.

Knowledge Distillation (KD) — Instead of the model learning from the training data directly, it learns by mimicking what a larger teacher model would output. The student model gets not just the correct answer but the teacher's "soft" probability distribution — its sense of how likely each possible word was. This is richer information than a simple right/wrong label.

Next-Token Prediction (NTP) — The standard approach: given all words so far, predict the very next word. Every language model does this by default.

Multi-Token Prediction (MTP) — An extension: don't just predict the next word — predict the next two words, three words, four words simultaneously. It forces the model to think further ahead and improves its understanding of how sequences unfold.

Speculative Decoding — An inference technique where a small "draft" model generates multiple candidate tokens, and the main model quickly verifies them. It's like a writer drafting several sentences at once and an editor approving the good ones in a single pass. Much faster than generating one word at a time.


Training Objective Shootout

The paper tested five different training configurations after compression:

Training ConfigurationMMLUMMLU-ProGSM-8KNotes
NTP KD alone74.1650.9784.27Pure distillation baseline
NTP KD + LM Loss74.9351.4482.98+0.77 on MMLU; LM loss reinforces factual knowledge
NTP KD + MTP KD75.1351.9482.34Multi-token teacher guidance helps across the board
NTP KD + LM + MTP Loss75.2951.1683.02Self-supervision on multiple horizons
Full: all four terms combined75.6751.1983.17Best overall across benchmarks

Key takeaways:

  1. Adding LM loss to distillation improves knowledge-intensive benchmarks. Pure distillation (the model mimicking a teacher) isn't enough. The model also needs to practice predicting the actual next token on its own — it reinforces the factual knowledge.
  1. Multi-token prediction knowledge distillation consistently helps. When the teacher model teaches the student not just "what's the next word" but "what are the next 2, 3, 4 words," the student develops a richer understanding of context and sequence patterns.
  1. The combination of all techniques is the best. The full objective — NTP KD + LM + MTP Loss + MTP KD — achieves the highest scores on the most benchmarks.

The Multi-Token Prediction Breakdown: Why Prediction Gets Harder

Here's the elegant part about MTP KD that might not be obvious: predicting multiple tokens forward forces the model to develop better internal representations. It's the difference between planning your next move in chess versus planning your next three moves. The latter requires a fundamentally deeper understanding of the board.

MTP KD specifically uses the teacher model's probability distributions for future tokens — not just the correct tokens — so the student learns the teacher's uncertainty and confidence about upcoming content.

The paper demonstrates this is not just a training nicety: it has practical benefits for inference speed.


Better Speculative Decoding

Speculative decoding uses a "draft" model to generate multiple candidate tokens at once, then checks them against the main model. The catch: the draft model needs to be good at guessing multiple tokens correctly for this to be worthwhile.

MTP KD makes the draft model significantly better at this. The paper's results show consistent improvements across benchmarks:

StageApproach1-Token Acceptance4-Token Acceptance
PretrainMTP Loss56.31%4.09%
PretrainMTP KD68.60%8.24%
SFTMTP Loss64.68%3.91%
SFTMTP KD69.49%6.59%

The 4-token acceptance rate doubled for pretraining. That means twice as many multi-token drafts are accepted, resulting in significantly faster generation without sacrificing quality.

The takeaway: MTP KD isn't just improving the model's knowledge — it's improving the model's mechanics of how it generates text.


Training Hyperparameters: The Practical Details

For completeness, the paper trained models with:

  • 120B or 400B tokens (two different training budgets)
  • Learning rates starting at 0.0004, decaying to 0.00003
  • KD loss weight decaying linearly from 1.0 to 0.75 — meaning the teacher's influence gradually decreases as the student becomes more confident
  • MTP distillation weight decaying via a cosine schedule from 0.3 to 0.1

The gradual decay of distillation makes sense: early in training, the student needs heavy guidance from the teacher. As it develops competence, it should rely more on its own predictions.


The Timing Question

The training recipe is solid. But there's one more lever to pull: timing. Should you compress and then train all at once, or gradually? Chapter 6 shows why the gradual approach wins decisively — and by how much.


Next: Chapter 6 — The Slow Squeeze

← Previous Next →