The Limits of Drafting
Speculative decoding — in all its forms — has a fundamental limitation. It optimizes the decode phase: generating tokens after the prompt has been processed. But the prompt itself — the input — still needs to be read in one shot. As prompts grow to hundreds of thousands of tokens, that initial processing becomes a bottleneck of its own.
It also doesn't change the fact that the big model still needs to be loaded from memory for each verification pass. It just makes each loading worth more tokens. The memory wall is still there; we've just found a way to carry more groceries per trip.
And there's a subtlety: speculative decoding works best when the draft model is just good enough. If the draft model is too weak, acceptance rates collapse and you get no speedup. If the draft model is too strong, its overhead eats into the gains. Finding the sweet spot — the optimal draft model size and training regime — is as much art as science.
Nevertheless, speculative decoding is the single most widely deployed memory-wall workaround in production today. Every major AI company uses some variant of it. It works, it's lossless, and it requires no hardware changes.
But to attack the wall from the other side — the prompt processing side — we need a completely different set of ideas. Which brings us to the most radical approach of all.