Introduction: The Kitchen is Still Waiting

If you read The Waiting Game, you already know the problem. It's physics. No one is going to invent a faster speed of light. No one is going to make copper a better conductor. The memory wall is real, and it's not going anywhere.

A GPU is like a world-class chef working in a kitchen where the pantry is on the other side of a football field. The chef can chop, sauté, and plate faster than anyone alive — but spends most of their time walking back and forth to get ingredients. The AI doesn't wait for the math to finish; the AI waits for the memory to arrive.

Every time a language model generates a single word (a "token"), it needs to load its entire brain — billions of numbers representing learned patterns — from memory into the compute units. That's gigabytes of data moved for a tiny calculation. The compute units sit idle, waiting. The utilization of a GPU during inference is often below 10%. Ninety percent of the silicon sits dark, waiting for data. This is the memory wall, and it's the single most important constraint in AI today.

The Waiting Game established this physics. It showed why faster chips don't solve the problem, why the KV cache grows without bound, why batch sizes have a U-shaped cost curve. It painted the problem in clear, unforgiving terms.

This book is about the solutions.

Because here's the thing about walls: people build doors through them. They build ladders over them. They build tunnels under them. And sometimes — when the wall is truly impassable — they build winches and pulleys and elaborate contraptions to get their supplies across without making the trip themselves.

This book is a catalog of those contraptions. Each chapter covers one family of techniques that researchers and engineers have invented to get more tokens per second out of the same hardware. Some are algorithmic tricks. Some are architectural reinventions. Some are so counterintuitive they sound like cheating — and in a sense, they are.

None of them violate the laws of thermodynamics. The memory is still slow. The compute units still wait. But by being smarter about when and how many and what we move across that football field, we can make the waiting time effectively vanish.

We'll start with the simplest idea: what if instead of making the trip once per token, we hired an assistant to guess ahead — and then verified all the guesses at once?

Chapter 1: The Drafting Table