Speculative Decoding and the Art of the Good Guess

Or: Why executives who sign off on a stack of memos get more done than those who read them one at a time

Here's the absurdity at the heart of AI inference. To generate a single token — say, the word "water" — a language model must load billions of parameters from memory into its compute units. It does all that work, generates "water," then throws most of the intermediate results away and repeats the entire process for the next token. It's like a chef walking the length of a football field to grab a single basil leaf, walking back, then immediately walking the same distance for a single pinch of salt, then back again for a single clove of garlic.

The obvious question: why not grab multiple ingredients in one trip?

That's the insight behind speculative decoding, and it's where the attack on the memory wall begins. It's not a new idea in computing — branch prediction, caching, and prefetching have been standard in CPU design for decades. But applying it to neural network inference required a specific insight: the verifier guarantees correctness.