Chapter 4: Merging and Preserving
The Partial-Preservation Strategy
Chapter Goals
By the end of this chapter, you will understand:
- Why the debate between expert pruning and expert merging is somewhat moot at scale
- The "partial-preservation" strategy proposed in the paper — and why it works
- The central tradeoff in expert compression: individual specialization vs. consolidation
- How merge decisions are actually made
The previous chapter covered how to measure which experts are important. But here's a counterintuitive finding from the SlimQwen paper:
After training with 400B tokens, it barely matters which experts you removed.
The Surprising Equalizer
The researchers tried multiple methods for deciding which experts to prune or merge:
- Frequency-based removal
- Soft-logits weighted selection
- REAP (router-weighted activation)
- Merging with various similarity criteria
After a massive training run involving 400B tokens, no single approach dominated across all benchmarks. Some methods performed better on specific tasks — for instance, frequency-based methods scored slightly higher on BBH (a reasoning test) — but there was no clear winner overall.
Why? Because large-scale continual pretraining acts as a great equalizer. Given enough training, the model recovers and adapts regardless of which specific experts were removed initially. The training budget washes out the differences between individual one-shot compression choices.
This is both liberating and unsettling. Liberating because you don't need to obsess over the perfect pruning metric. Unsettling because it suggests that the initial compression choice matters less than what happens after.
Applied Definitions
Expert Merging — Instead of deleting an expert entirely, you combine its weights with another expert's weights. It's like asking two departing employees to hand off their knowledge to a single person rather than losing it.
Expert Pruning — Simply removing an expert from the model. Clean, but all that expert's knowledge is gone.
Cosine Similarity — A way of measuring how "alike" two experts are. If two experts produce similar outputs for similar inputs, they're good candidates for merging. High cosine similarity means "these folks are doing similar work."
Partial-Preservation Strategy — The novel approach proposed in this paper where half of the target experts are kept intact, and the other half are constructed by merging discarded experts into them.
The Partial-Preservation Strategy
This is the paper's key innovation for expert compression. Here's how it works:
Step 1: Split the target in half
- The model needs to end up with, say, 256 experts (down from 512)
- Reserve the first half (128 experts) as the "preserve" set
- Reserve the second half (128 experts) as the "merge base" set
Step 2: Keep the best ones intact
- Score all 512 experts by importance
- Keep the top 128 experts exactly as they are — no modification
- These are the model's star performers
Step 3: Merge the rest
- The remaining 384 experts (the middle tier and lower) need to be consolidated
- For each of the 128 merge bases, find the most similar discarded expert
- Combine them using a weighted average based on their importance scores
Step 4: Assemble the new set
- The final 256 experts = 128 preserved + 128 merged
- Each merged expert is a blend:
new = (weight_A / total) * expert_A + (weight_B / total) * expert_B
Why This Works: The Knowledge Tradeoff
The partial-preservation strategy addresses a real tension that all simpler approaches miss:
| Extreme Approach | Problem |
|---|---|
| Keep only the top experts | You preserve highly salient knowledge, but discard experts that might be complementary — specialists in niche but important areas |
| Merge everything | You homogenize expert specialization, blurring distinctions that took enormous training to develop |
The half-and-half approach offers a pragmatic middle ground. The top performers stay sharp. The rest get consolidated intelligently. The result consistently improves scores across major benchmarks compared to either pure pruning or pure merging.
A Concrete Analogy
Imagine a law firm with 512 lawyers shrinking down to 256:
- 128 star performers — keep them exactly as they are, untouched
- 128 merge anchors — keep them too, but assign them the job of absorbing departing colleagues' expertise
- 256 departing lawyers — their knowledge gets folded into the merge anchors they work most similarly to
The stars stay sharp. The anchors take on new specializations. Everyone else's expertise is absorbed rather than lost.
The Numbers
The paper's results showed partial preservation improving scores in most categories:
| Method | MMLU | GSM-8K | C-Eval |
|---|---|---|---|
| Merging without preservation | 69.05 | 71.08 | 71.08 |
| Merging with partial preservation | 69.28 | 74.18 | 71.15 |
These aren't massive jumps on individual metrics, but they're consistent improvements — and consistency across diverse benchmarks is the hallmark of a robust approach.
The Paper Chose "Half" for a Reason
The authors acknowledge that choosing exactly half is a "simple and symmetric design choice." Their intuition:
- Preserving too few experts (e.g., only 25%) → insufficient parameter inheritance, losing too much pretrained knowledge
- Preserving too many experts (e.g., 75%) → not enough room for consolidation, the model barely compresses
Roughly 50% hits a sweet spot that works across their evaluated settings. The paper notes this ratio might not be optimal for all configurations — it's a starting point, not a universal law.
What's Ahead
The model has survived compression and merged its experts. But can it actually perform? Chapter 5 covers the rehabilitation process: the training recipes, the distillation choices, and the surprising multi-token prediction technique that makes the compressed model not just smaller, but better.