← Back to Red Deer Investments  ·  AI Library Home

Chapter 4: Merging and Preserving

The Partial-Preservation Strategy


Chapter Goals

By the end of this chapter, you will understand:

  • Why the debate between expert pruning and expert merging is somewhat moot at scale
  • The "partial-preservation" strategy proposed in the paper — and why it works
  • The central tradeoff in expert compression: individual specialization vs. consolidation
  • How merge decisions are actually made

The previous chapter covered how to measure which experts are important. But here's a counterintuitive finding from the SlimQwen paper:

After training with 400B tokens, it barely matters which experts you removed.


The Surprising Equalizer

The researchers tried multiple methods for deciding which experts to prune or merge:

  • Frequency-based removal
  • Soft-logits weighted selection
  • REAP (router-weighted activation)
  • Merging with various similarity criteria

After a massive training run involving 400B tokens, no single approach dominated across all benchmarks. Some methods performed better on specific tasks — for instance, frequency-based methods scored slightly higher on BBH (a reasoning test) — but there was no clear winner overall.

Why? Because large-scale continual pretraining acts as a great equalizer. Given enough training, the model recovers and adapts regardless of which specific experts were removed initially. The training budget washes out the differences between individual one-shot compression choices.

This is both liberating and unsettling. Liberating because you don't need to obsess over the perfect pruning metric. Unsettling because it suggests that the initial compression choice matters less than what happens after.


Applied Definitions

Expert Merging — Instead of deleting an expert entirely, you combine its weights with another expert's weights. It's like asking two departing employees to hand off their knowledge to a single person rather than losing it.

Expert Pruning — Simply removing an expert from the model. Clean, but all that expert's knowledge is gone.

Cosine Similarity — A way of measuring how "alike" two experts are. If two experts produce similar outputs for similar inputs, they're good candidates for merging. High cosine similarity means "these folks are doing similar work."

Partial-Preservation Strategy — The novel approach proposed in this paper where half of the target experts are kept intact, and the other half are constructed by merging discarded experts into them.


The Partial-Preservation Strategy

This is the paper's key innovation for expert compression. Here's how it works:

Step 1: Split the target in half

  • The model needs to end up with, say, 256 experts (down from 512)
  • Reserve the first half (128 experts) as the "preserve" set
  • Reserve the second half (128 experts) as the "merge base" set

Step 2: Keep the best ones intact

  • Score all 512 experts by importance
  • Keep the top 128 experts exactly as they are — no modification
  • These are the model's star performers

Step 3: Merge the rest

  • The remaining 384 experts (the middle tier and lower) need to be consolidated
  • For each of the 128 merge bases, find the most similar discarded expert
  • Combine them using a weighted average based on their importance scores

Step 4: Assemble the new set

  • The final 256 experts = 128 preserved + 128 merged
  • Each merged expert is a blend: new = (weight_A / total) * expert_A + (weight_B / total) * expert_B

Why This Works: The Knowledge Tradeoff

The partial-preservation strategy addresses a real tension that all simpler approaches miss:

Extreme ApproachProblem
Keep only the top expertsYou preserve highly salient knowledge, but discard experts that might be complementary — specialists in niche but important areas
Merge everythingYou homogenize expert specialization, blurring distinctions that took enormous training to develop

The half-and-half approach offers a pragmatic middle ground. The top performers stay sharp. The rest get consolidated intelligently. The result consistently improves scores across major benchmarks compared to either pure pruning or pure merging.


A Concrete Analogy

Imagine a law firm with 512 lawyers shrinking down to 256:

  • 128 star performers — keep them exactly as they are, untouched
  • 128 merge anchors — keep them too, but assign them the job of absorbing departing colleagues' expertise
  • 256 departing lawyers — their knowledge gets folded into the merge anchors they work most similarly to

The stars stay sharp. The anchors take on new specializations. Everyone else's expertise is absorbed rather than lost.


The Numbers

The paper's results showed partial preservation improving scores in most categories:

MethodMMLUGSM-8KC-Eval
Merging without preservation69.0571.0871.08
Merging with partial preservation69.2874.1871.15

These aren't massive jumps on individual metrics, but they're consistent improvements — and consistency across diverse benchmarks is the hallmark of a robust approach.


The Paper Chose "Half" for a Reason

The authors acknowledge that choosing exactly half is a "simple and symmetric design choice." Their intuition:

  • Preserving too few experts (e.g., only 25%) → insufficient parameter inheritance, losing too much pretrained knowledge
  • Preserving too many experts (e.g., 75%) → not enough room for consolidation, the model barely compresses

Roughly 50% hits a sweet spot that works across their evaluated settings. The paper notes this ratio might not be optimal for all configurations — it's a starting point, not a universal law.


What's Ahead

The model has survived compression and merged its experts. But can it actually perform? Chapter 5 covers the rehabilitation process: the training recipes, the distillation choices, and the surprising multi-token prediction technique that makes the compressed model not just smaller, but better.

← Previous Next →