Chapter 3: The Art of the Cut

Pruning and How to Decide What Stays

Chapter Goals

By the end of this chapter, you will understand:

The three types of structural pruning applied to MoE models
How researchers determine which parts of a model are "important" and which are expendable
Why removing the last layers works better than removing the middle ones
The surprising finding about pruning initialization — and why it matters

Pruning a neural network sounds violent — and in a way, it is. You're deliberately removing parts of something that took enormous resources to build. But done correctly, pruning is surgical, not destructive. The key question is always: what do you cut, and how do you decide?

Depth Pruning: Removing Layers

Think of a language model's layers as stations on an assembly line. Each station takes in information, does some processing, and passes it along. The early stations handle basic pattern recognition, the middle stations work on semantic understanding, and the later stations refine the output.

The paper compared two approaches on a smaller test model (15A2B, 24 layers — results in the appendix):

Activation similarity — find layers that produce nearly identical outputs and remove the redundant ones (these tend to be in the middle of the model)
Last-layer pruning — simply remove the tail end of the model

The result was decisive. Removing the last layers caused only minor performance drops. But removing middle layers based on activation similarity led to catastrophic failures — on some benchmarks, performance fell from 75.62 to 41.95 on MMLU (a general knowledge test). The team then confirmed the same pattern held at scale on the main 80B model, where last-layer pruning remained the clear winner.

The lesson: the later layers are genuinely more expendable than the early and middle ones. This aligns with an emerging understanding that deeper layers in language models contribute less unique value than previously assumed.

Width Pruning: Shrinking the Model's Brain

Width pruning reduces the "hidden dimension" — the internal representation size used throughout the model. If you think of each layer as a workspace, width pruning makes the workbench smaller.

Here's how the researchers decided which parts of the workbench to keep:

Run calibration data through the model
Measure the mean absolute activation for each hidden dimension — essentially, how "active" each internal channel is
Keep the dimensions with the highest activity across all layers
Discard the quiet ones

It's like mapping which roads carry the most traffic in a city, then closing down the least-used streets. The data flows through the well-traveled routes anyway; the others barely see any use.

Expert Pruning: The MoE Challenge

This is where MoE models diverge from traditional architectures. Experts need their own compression strategy because they represent specialized knowledge pockets.

The paper evaluated three ways to measure expert importance:

Frequency-based — How often does the router pick this expert? The most frequently chosen experts are the busiest workers.

Soft-logits — Not just whether the expert is chosen, but how strongly it's chosen. An expert picked with strong confidence is probably doing something meaningful.

REAP (Router-weighted Expert Activation) — A more sophisticated metric that combines the router's weight with the magnitude of the expert's actual output. It asks: "Not just how often, but how impactful when active?"

The Big Finding: Initialization Matters

One of the paper's most significant results answers a fundamental question:

Should you start with a pruned giant or a small model trained from scratch?

The answer, resoundingly: start with the pruned giant.

| Approach | Average Score | Key Insight |

|----------|--------------|-------------|

| Random init + knowledge distillation | 61.66 | Starting fresh, even with teacher guidance |

| Pruned + language modeling loss | 69.96 | Starting smart, self-study |

| Pruned + knowledge distillation | 73.45 | Starting smart + teacher guidance |

The pruned model recovered 86.5% of the teacher's performance while being compressed to roughly a quarter of its size (80 billion parameters to 23 billion). The training also converged faster — the loss curves dropped more steeply and reached lower values.

Why? Because the pruned model already encodes useful knowledge. The remaining weights aren't random; they were shaped by an enormous training run. It's the difference between training someone from zero versus reassigning a proven employee to a leaner role.

Applied Definitions

Calibration Dataset — A small set of representative data (1,024 samples in this paper) used to measure which parts of the model are doing the most work. Think of it as a diagnostic tool: run the model on some standard cases and observe which components are active.

Importance Score — A numerical value assigned to each component (dimension, layer, expert) that estimates its contribution to the model's output. Higher scores mean "keep this"; lower scores mean "consider removing."

One-Shot Pruning — Removing all target components at once in a single step. Fast, but can cause a sharp performance drop that needs significant recovery training.

What Comes Next

We've covered how to decide what to cut. But here's a twist: after enough training, different cutting strategies end up performing remarkably similarly. That raises an entirely different question — *does it actually matter which specific experts you remove?

Chapter 4 picks up that thread and introduces the paper's most original contribution: a partial-preservation merging strategy that consistently improves performance.

Next: Chapter 4 — Merging and Preserving