The Draft Model: The Junior Chef
The core idea is simple, and it has nothing to do with AI. It's about delegation.
Imagine you're a busy executive who needs to approve a hundred routine memos. Each memo takes a minute to read and thirty seconds to sign. At that rate, the pile takes 150 minutes. But there's a bottleneck: you're the only one who can sign, and you keep getting interrupted.
So you hire a junior assistant who's pretty good at guessing which memos you'd approve. The assistant reads all a hundred, signs the obvious ones, and only brings the uncertain ones to you. For the memos the assistant got right, you never even see them — they're already done. You read the filtered stack and sign the few that need your judgment.
In the best case, the assistant gets ninety right and brings you ten. Your 150-minute pile just became a 15-minute task. And crucially: if the assistant makes a mistake and signs something you wouldn't have, you catch it. The assistant's work is provisional. You're the final word.
This is speculative decoding. A small, fast "draft model" generates several candidate tokens in a single pass. Then the big, expensive "target model" runs once to verify all of them at once. Any token the target model agrees with is accepted. Any token it disagrees with is rejected — and everything guessed after that point goes with it.
The key insight: that single verification pass costs about the same as generating one token normally. You load the entire model from memory exactly once. But you get 2, 3, or even 6 tokens of output from that single load. You've amortized the cost of the memory trip across multiple tokens.
The draft model doesn't need to be perfect. It just needs to be right often enough. Even a 50% acceptance rate on a 4-token draft doubles the throughput. And here's the critical guarantee: the verification step is mathematically lossless. The output distribution is identical to what the big model would have produced on its own, token by token. You lose nothing. You gain speed.
This lossless property is what makes speculative decoding different from approximation. It's not "produce the same answer but faster." It's "produce the exact same distribution of answers but faster." The assistant's guesses save trips to memory, but the chef never serves a dish they wouldn't have made themselves.