Chapter 6: The Slow Squeeze

Why Progressive Pruning Beats One-Shot

Chapter Goals

By the end of this chapter, you will understand:

Why compressing a model all at once is suboptimal
The three progressive pruning strategies compared in the paper
How gradual structural changes lead to better optimization
Which schedule won — and by how much

Here's a question that cuts to the heart of how learning works:

If a company is cutting its budget from $10 million to $1 million, should it gut 90 percent in one quarter or phase the reductions over time, learning to adapt at each stage?

The SlimQwen researchers asked the same question of model compression. And the data was unambiguous.

One-Shot vs. Progressive

One-shot compression is exactly what it sounds like: take the giant model, remove everything you don't want in a single step, then train the result from that starting point.

Progressive compression breaks the transformation into stages. In the first stage, you compress partially and train a bit. In the second stage, you compress the rest and train some more.

The analogy to personal development is direct: if you're going from an active social life to a focused period of solitude, going cold turkey is more jarring than gradually reducing your commitments over months.

Applied Definitions

Progressive Pruning — Compressing a model in stages rather than all at once. After each pruning stage, the model is trained on additional data to recover before the next compression step.

Depth-First Schedule — A strategy that removes layers (depth) first, then shrinks the model's width afterward. Think of it as reducing the number of steps in the processing pipeline before making each step narrower.

Width-First Schedule — Shrinks the hidden dimensions first, then removes layers. Make each step smaller, then take some steps away entirely.

Joint Schedule — Reduces both depth and width simultaneously at each stage. Half the layer reduction and half the width reduction in stage one; the rest in stage two.

Token Budget — The total amount of training data measured in tokens. The paper compared one-shot (400B tokens) against progressive (40B + 360B = 400B total tokens). Same total budget, different allocation.

The Three Strategies

The researchers designed two-stage schedules, each using 400B total tokens:

| Strategy | Stage 1 | Stage 2 |

|----------|---------|---------|

| Depth-First | Remove half the target layers, keep width, train 40B tokens | Remove remaining layers + all width, train 360B tokens |

| Width-First | Halve the width, keep depth, train 40B tokens | Remove all target layers + remaining width, train 360B tokens |

| Joint | Halve both depth and width targets, train 40B tokens | Complete the remaining reductions, train 360B tokens |

One-shot comparison: remove everything at once, train on all 400B tokens.

The Results

Every progressive strategy beat one-shot compression. Every single one.

|--------|------|-----------|--------|-----------------|

| One-shot | 75.86 | 75.41 | 85.22 | Baseline |

| Joint | 76.30 | 76.93 | 86.05 | +0.44 on MMLU |

| Width-First | 77.14 | 77.07 | 84.00 | +1.28 on MMLU |

| Depth-First (SlimQwen) | 77.39 | 78.01 | 85.82 | +1.53 on MMLU |

The improvements aren't marginal noise. On MMLU-Redux (a challenging knowledge benchmark), depth-first gained 2.60 points over one-shot. On MMLU-Pro, the gain was 0.25 points. Across the board, progressive approaches outperformed or matched the one-shot baseline.

Depth-first was the official winner, and the model it produced became the published "SlimQwen" release.

Why Gradual Pruning Works

The paper offers an insightful explanation: staged capacity reduction provides a smoother optimization trajectory for knowledge transfer.

Think of it this way: when you compress a model all at once, you create a large gap between what the model can currently do and what its new architecture requires. The training process has to bridge this gap from a cold start, which makes optimization difficult — the model is essentially learning to cope with a new body all at once.

Progressive pruning, by contrast, lets the model adapt gradually. After the first pruning stage, it has trained on 40B tokens and recovered much of its capability. When the second stage hits, the model is already partially adapted to a smaller architecture. The "gap" is smaller, making it easier for the training to fill in.

The Diminishing Returns of More Stages

The researchers also tested three-stage schedules (20B + 20B + 360B) to see if finer-grained compression would help even more. The answer: not really.

The three-stage results were comparable to the two-stage setup. Some benchmarks showed slight improvements; others showed slight regressions. Overall, the two-stage approach was sufficient.

This is a useful practical boundary. It suggests that some gradualness is valuable, but excessive granularity doesn't buy you much.

A Broader Intuition

Progressive compression taps into a general principle that appears everywhere in learning and adaptation: intermediate stabilization matters. When a system — biological, mechanical, or neural — undergoes a transformation, giving it time to consolidate before the next change improves outcomes. Language models, despite being mathematical constructs, follow this pattern too.

Where We Go From Here

We've seen why gradual beats sudden. Now it's time to look at the scoreboard. Chapter 7 brings all the numbers together and shows what SlimQwen actually achieved — in capability, in speed, and in deployment cost.