← Back to Red Deer Investments  ·  AI Library Home

Chapter 2: Mixture of Experts

How AI Learned to Specialize


Chapter Goals

By the end of this chapter, you will understand:

  • What "Mixture of Experts" (MoE) means and why it was invented
  • How experts, routers, and shared specialists work together inside a model
  • Why MoE models are especially tricky to compress
  • The three axes along which MoE models can be shrunk

To understand what SlimQwen does, you first need to understand the architecture it operates on: Mixture of Experts, or MoE for short.


What Is Mixture of Experts?

Imagine you walk into a hospital. You don't need every specialist — cardiologist, neurologist, orthopedic surgeon — examining you at once. Instead, a triage nurse assesses your symptoms and routes you to the right specialist. Sometimes you see one, sometimes two, but never the entire hospital staff.

That's Mixture of Experts.

In a traditional language model, every single parameter is involved in processing every word. It's like having the entire hospital staff weigh in on every patient. MoE models change this: the model contains many "experts" (groups of parameters specialized for certain types of tasks), but for any given word, only a select few experts are activated.

The result? A model that can be enormously capable (because it has many experts on staff) while remaining efficient at runtime (because only a few experts work on each word).


Applied Definitions

Expert — A self-contained neural network module within the model that handles certain types of inputs. Think of it as a department in a company: one expert might specialize in mathematical reasoning, another in creative writing, another in code.

Router — The mechanism that decides which experts handle a given token. It's the triage nurse, the dispatcher, the manager who assigns work. The router looks at what the model is processing and says: "For this word, send it to Expert 3 and Expert 7."

Routed Experts — The specialists that only handle specific types of input chosen by the router. Not every token goes through every routed expert.

Shared Experts — A small set of experts that every token goes through, regardless of what the router decides. Think of them as the general practitioners everyone visits before being sent to specialists.

Top-K Routing — The router picks exactly K experts (usually 1, 2, or 10) out of the total pool for each token. In Qwen3-Next, the model activates roughly 10 routed experts plus 1 shared expert out of a pool of 512 routed experts plus shared experts.


The Numbers at Play

The model used in this paper, Qwen3-Next-80A3B, has:

  • 512 routed experts per layer, plus shared experts
  • Only 10 routed + 1 shared expert active at any given time
  • 48 transformer blocks (layers of processing)
  • A mix of full attention and linear attention mechanisms

That means the model has the capacity of an 80-billion-parameter system but, for each word, only activates about 3.8 billion parameters' worth of computation. It's an enormous bench where most of the players sit out for any given play.


Why MoE Makes Compression Harder

In a traditional model, compression is somewhat straightforward: you're removing layers, shrinking dimensions, cutting heads. But in an MoE model, there's a whole new dimension: the experts themselves.

You can:

  1. Prune experts — remove some entirely (fire some departments)
  2. Merge experts — combine multiple experts into fewer ones (merge departments)
  3. Adjust the router — change how the model decides who gets the work

And here's the tricky part: when you fire or merge experts, you're not just losing capacity — you're potentially losing specialization. If Expert #412 was the only one that really understood quantum physics terminology, removing it means that knowledge doesn't simply transfer to others.


The Three Axes of Compression

The paper explores MoE compression across three dimensions:

DimensionWhat It MeansAnalogy
DepthRemoving entire layers (transformer blocks)Reducing the number of management layers in a company
WidthShrinking the size of hidden dimensionsMaking each department smaller while keeping them all
ExpertsRemoving or merging expert modulesConsolidating specialized teams

Each axis affects the model differently, and the challenge — which SlimQwen tackles head-on — is figuring out the smartest way to cut across all three simultaneously.


Next: Chapter 3 — The Art of the Cut

← Previous Next →