Introduction: The $30,000 Paperweight

A single NVIDIA B200 GPU costs $30,000. It draws 700 watts of electricity — more than a window air conditioner running at full blast. It contains 208 billion transistors, enough to build a simple microprocessor for every human on Earth with transistors left over.

And it spends 98 percent of its time doing nothing.

Not nothing in the sense of being idle. Nothing in the sense of waiting. The GPU sits there, fully powered, consuming electricity, generating heat — waiting for data to arrive from its own memory. The compute units are starved. The math has to wait for the ingredients.

This is not a bug. It is not a design flaw. It is physics, and it is the single most important fact about AI today.

Every decision in the AI industry — which chips get built, how models are designed, what APIs cost, why your chatbot has a context limit, why agentic AI is expensive in ways you don't expect — traces back to this one physical constraint. Memory cannot keep up with compute. The gap between how fast a chip can compute and how fast it can read from memory is widening every year. And the implications touch everything.

This book explains why that gap exists, how it shows up in practice, and what people are doing about it. It does not contain a single equation. It does not assume you know what a GPU is, what a transformer is, or what "attention" means in the AI sense. It assumes only that you have noticed AI is becoming important and expensive, and you'd like to understand why.

The road ahead:

The first chapter describes the physical reality — the memory wall itself, and why faster chips don't solve it. The second chapter explains the hidden cost of memory inside every AI model, and why longer conversations get expensive. The third chapter shows how the economics of serving AI requests map to this physical constraint. Chapters four through six cover the algorithmic workarounds — the tricks engineers use to squeeze more work out of the same memory. Chapter seven ties everything together and explains why this moment matters: the waiting game, why everything slows down before it gets faster, and what the memory wall means for the future of AI.

One note before we begin. The ideas in this book draw heavily on a conversation between Dwarkesh Patel and Reiner Pope — Pope was a TPU architect at Google and is now CEO of Maddox, a chip startup focused on inference. His framework for thinking about inference economics is the backbone of what follows. Where specific insights come from him, the text will say so. But the goal is not an annotated transcript — it is a self-contained explanation, written for someone who wants to understand AI through the lens of physics and economics, not through the lens of a specific interview.

The first thing to understand is why a $30,000 chip spends most of its time waiting.