The Common Thread
Every variant of speculative decoding — classic, MEDUSA, EAGLE-1/2/3, MTP — exploits the same insight: the bottleneck is memory bandwidth, not compute.
Each verification pass loads the model from memory once and checks multiple tokens. You pay for one memory trip and get multiple tokens validated. The numbers vary, but the economics are universal:
- Vanilla decoding: 1 token per memory load
- Classic speculative decoding (50% acceptance): ~2 tokens per load
- MTP (85% acceptance): ~2.8 tokens per load
- EAGLE-3: 6--7 tokens per load
Each improvement doubles or triples the effective utilization of the GPU's memory subsystem. And because the verifier guarantees correctness, every token is exactly what the original model would have produced.
It's free performance. Or, more accurately, it's paid for in algorithmic cleverness rather than silicon.