← Back to Red Deer Investments  ·  AI Library Home

Chapter 6: The Slow Squeeze

Why Progressive Pruning Beats One-Shot


Chapter Goals

By the end of this chapter, you will understand:

  • Why compressing a model all at once is suboptimal
  • The three progressive pruning strategies compared in the paper
  • How gradual structural changes lead to better optimization
  • Which schedule won — and by how much

Here's a question that cuts to the heart of how learning works:

If a company is cutting its budget from $10 million to $1 million, should it gut 90 percent in one quarter or phase the reductions over time, learning to adapt at each stage?

The SlimQwen researchers asked the same question of model compression. And the data was unambiguous.


One-Shot vs. Progressive

One-shot compression is exactly what it sounds like: take the giant model, remove everything you don't want in a single step, then train the result from that starting point.

Progressive compression breaks the transformation into stages. In the first stage, you compress partially and train a bit. In the second stage, you compress the rest and train some more.

The analogy to personal development is direct: if you're going from an active social life to a focused period of solitude, going cold turkey is more jarring than gradually reducing your commitments over months.


Applied Definitions

Progressive Pruning — Compressing a model in stages rather than all at once. After each pruning stage, the model is trained on additional data to recover before the next compression step.

Depth-First Schedule — A strategy that removes layers (depth) first, then shrinks the model's width afterward. Think of it as reducing the number of steps in the processing pipeline before making each step narrower.

Width-First Schedule — Shrinks the hidden dimensions first, then removes layers. Make each step smaller, then take some steps away entirely.

Joint Schedule — Reduces both depth and width simultaneously at each stage. Half the layer reduction and half the width reduction in stage one; the rest in stage two.

Token Budget — The total amount of training data measured in tokens. The paper compared one-shot (400B tokens) against progressive (40B + 360B = 400B total tokens). Same total budget, different allocation.


The Three Strategies

The researchers designed two-stage schedules, each using 400B total tokens:

StrategyStage 1Stage 2
Depth-FirstRemove half the target layers, keep width, train 40B tokensRemove remaining layers + all width, train 360B tokens
Width-FirstHalve the width, keep depth, train 40B tokensRemove all target layers + remaining width, train 360B tokens
JointHalve both depth and width targets, train 40B tokensComplete the remaining reductions, train 360B tokens

One-shot comparison: remove everything at once, train on all 400B tokens.


The Results

Every progressive strategy beat one-shot compression. Every single one.

MethodMMLUMMLU-ReduxGSM-8KAvg Improvement
One-shot75.8675.4185.22Baseline
Joint76.3076.9386.05+0.44 on MMLU
Width-First77.1477.0784.00+1.28 on MMLU
Depth-First (SlimQwen)77.3978.0185.82+1.53 on MMLU

The improvements aren't marginal noise. On MMLU-Redux (a challenging knowledge benchmark), depth-first gained 2.60 points over one-shot. On MMLU-Pro, the gain was 0.25 points. Across the board, progressive approaches outperformed or matched the one-shot baseline.

Depth-first was the official winner, and the model it produced became the published "SlimQwen" release.


Why Gradual Pruning Works

The paper offers an insightful explanation: staged capacity reduction provides a smoother optimization trajectory for knowledge transfer.

Think of it this way: when you compress a model all at once, you create a large gap between what the model can currently do and what its new architecture requires. The training process has to bridge this gap from a cold start, which makes optimization difficult — the model is essentially learning to cope with a new body all at once.

Progressive pruning, by contrast, lets the model adapt gradually. After the first pruning stage, it has trained on 40B tokens and recovered much of its capability. When the second stage hits, the model is already partially adapted to a smaller architecture. The "gap" is smaller, making it easier for the training to fill in.


The Diminishing Returns of More Stages

The researchers also tested three-stage schedules (20B + 20B + 360B) to see if finer-grained compression would help even more. The answer: not really.

The three-stage results were comparable to the two-stage setup. Some benchmarks showed slight improvements; others showed slight regressions. Overall, the two-stage approach was sufficient.

This is a useful practical boundary. It suggests that some gradualness is valuable, but excessive granularity doesn't buy you much.


A Broader Intuition

Progressive compression taps into a general principle that appears everywhere in learning and adaptation: intermediate stabilization matters. When a system — biological, mechanical, or neural — undergoes a transformation, giving it time to consolidate before the next change improves outcomes. Language models, despite being mathematical constructs, follow this pattern too.


Where We Go From Here

We've seen why gradual beats sudden. Now it's time to look at the scoreboard. Chapter 7 brings all the numbers together and shows what SlimQwen actually achieved — in capability, in speed, and in deployment cost.

← Previous Next →