Chapter 5: The Hotel with Specialty Wings

Or: Why activating 37 out of 256 experts is cheaper than activating all of them

All neural networks, by default, are generalists. Every parameter activates for every input. When you ask a model "what is the capital of Mongolia," all 70 billion of its parameters fire, including the ones that handle French grammar and the ones that encode knowledge of marine biology. Most of that computation is wasted.

This is the inefficiency that Mixture of Experts — MoE — was designed to solve. Instead of one giant generalist network, MoE creates many smaller specialist networks — "experts" — and a router that decides which experts to activate for each input.

The result is a model that has the capacity of a much larger network while only paying the cost of a much smaller one.

The Generalist Problem

A dense transformer — the standard architecture behind most large language models — applies every parameter to every token. Whether you are generating the word "the" or "constitutional," the entire parameter matrix multiplies your input. The model does not discriminate between simple and complex tokens. It does not skip the parts that are irrelevant to the current word.

This is like a hospital that staffs every department for every patient. A patient with a broken arm gets checked in by a cardiologist, a neurologist, a dermatologist, and an orthopedist — because the hospital employs all of them and they all have to work on every case. It is comprehensive. It is also wildly inefficient.

Most of the time, most of a model's parameters are computing things that do not contribute to the output. They are dead weight. The model is powerful because it has many parameters. But it is expensive because it uses all of them for everything.

The Specialist Approach

MoE solves this by partitioning the model's parameters into groups called experts. Each expert is a neural network in its own right — smaller than the full model, specialized through training to handle certain types of inputs. A small router network examines each incoming token and decides which experts are most relevant.

DeepSeek, the Chinese AI lab behind some of the most advanced MoE models, uses 256 experts in its V3 architecture. For every token, the router activates 37 of them. That means only about 14 percent of the model's total parameters are used for any single computation. The remaining 86 percent stay in memory but do no work.

The result is a model with the effective capacity of a 1-trillion-parameter network that only costs as much as a 185-billion-parameter network to run. The capacity is there when needed — for complex reasoning, multi-step inference, or rare knowledge — but the cost is proportional to what is actually used.

This is the hotel analogy. A hotel with 256 specialty wings — one for business travelers, one for families, one for conference attendees, one for weddings — can handle any guest type. But on a given night, only the wings relevant to that night's guests are staffed. The hotel's total capacity is enormous, but its operating cost is proportional to utilization.

The Router

The router is the hidden intelligence of an MoE model. It is a small neural network — maybe a few hundred million parameters, compared to the hundreds of billions in the experts — that learns to map tokens to expert groups.

The router does not know what the experts "know" in any human sense. It is trained end-to-end with the rest of the model. Over the course of training, each expert develops a specialization: some become good at syntax, others at factual recall, others at mathematical reasoning, others at code generation. The router learns which expert to call for which token type.

This is not manually programmed. It emerges from training. The router discovers that certain hidden patterns in the input — a particular grammatical structure, a domain-specific vocabulary, a certain style of reasoning — map to certain experts. It routes accordingly.

The challenge is load balancing. If the router routes every token to the same three experts, 253 experts sit idle and the model's capacity is wasted. The training process must explicitly penalize this behavior, forcing the router to distribute tokens across experts roughly evenly. This is one of the hardest engineering problems in MoE: keeping all experts utilized without forcing them to be generalists.

The Communication Bottleneck

MoE introduces a problem that dense models do not have: communication between experts.

In a dense model, all parameters live on the same chip or group of chips. Computing the output is a matter of loading the parameters and running the math. In an MoE model, different experts may live on different chips. When the router activates experts that are spread across the hardware, the input must be sent to each expert's chip, and the results must be gathered back.

This all-to-all communication pattern is expensive. Moving data between chips is slower than reading from local memory. For very large MoE models deployed across hundreds of chips, the communication cost can dominate — to the point where the savings from sparse activation are eaten by the overhead of routing.

DeepSeek invested heavily in minimizing this overhead through careful engineering: expert placement, communication scheduling, and hardware-aware routing. The fact that MoE works as well as it does in practice is a testament to the engineering, not the architecture alone.

The Memory Wall Connection

MoE does not escape the memory wall. It changes the shape of the problem.

Because all 256 experts must be kept in memory — you never know which ones the router will call — the model's memory footprint is still proportional to its total parameter count. The B200 still has to load expert parameters from memory. The saving is not in memory bandwidth. The saving is in compute: only 37 experts do math per token, so the compute requirement is 85 percent lower.

This means MoE is most beneficial when the workload is compute-bound — when the bottleneck is the math, not the memory. For inference workloads, which are usually memory-bound, MoE helps less than you might expect. The memory bandwidth — the football-field pantry — is still the constraint.

But MoE is enormously valuable for a different reason. It lets model builders scale up parameter counts without scaling up compute costs proportionally. A 1-trillion-parameter MoE model can be served at roughly the same compute cost as a 200-billion-parameter dense model, while delivering far better performance on complex tasks. This is why MoE has become the dominant architecture for frontier models: it is the most efficient way to spend the compute budget.

And it points toward a future where models get smarter without getting proportionally more expensive — which is precisely what the waiting game is all about.