Sources
The framework and through-line of this book draw primarily from the Dwarkesh Patel podcast interview with Reiner Pope, former TPU architect at Google and now CEO of Maddox. Pope's blackboard walkthrough of inference economics — covering the memory wall, KV caches, batch sizing, and MoE architecture — shapes the core argument of every chapter.
Reiner Pope / Dwarkesh Patel Interview
Dwarkesh Patel, "Reiner Pope on the Economics of AI Inference," from the Dwarkesh Podcast. Available at dwarkesh.com.
The Inference Shift
Ben Thompson, "Agents Over Bubbles," Stratechery, March 2026. Available at stratechery.com.
Hardware
NVIDIA, NVIDIA H100 Tensor Core GPU Architecture white paper, 2022.
NVIDIA, NVIDIA Blackwell Architecture Technical Overview, 2024.
Jouppi et al., "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings," Google (2023). arxiv.org/abs/2304.01433.
KV Cache and Attention
Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with I/O-Awareness," NeurIPS 2022. arxiv.org/abs/2205.14135.
Speculative Decoding
Leviathan et al., "Fast Inference from Transformers via Speculative Decoding," ICML 2023. arxiv.org/abs/2211.17192.
Cai et al., "MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads," ICML 2024. arxiv.org/abs/2401.10774.
Li et al., "EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test," arXiv 2025. arxiv.org/abs/2503.01840.
MoE
DeepSeek-AI, "DeepSeek-V3 Technical Report," arXiv 2024. arxiv.org/abs/2412.19437.
Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," ICLR 2017. arxiv.org/abs/1701.06538.
Training and Scaling
Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla), NeurIPS 2022. arxiv.org/abs/2203.15556.