MEDUSA: Train Extra Heads Instead of a Second Model
One obvious problem with speculative decoding: you need a draft model. That's a whole second neural network to train, deploy, and maintain. Where does it come from? How do you keep its weights in sync with the target model's?
Researchers at the University of Texas proposed an elegant solution in early 2024: MEDUSA. Instead of a separate draft model, add multiple small "draft heads" directly onto the target model. A head is just a tiny neural network — a few million parameters compared to the billions in the main model. You train the heads to predict the next few tokens, using the internal representations the main model already computes for free.
The analogy: a chef who, while chopping an onion, also mutters "and after this I'll need salt, then pepper, then olive oil" — and has trained themselves to be right about that forecast 80% of the time. The chef doesn't need a separate assistant. The forecasting is built in. It's a small, learned habit layered on top of existing skills.
MEDUSA achieved 1.8--2.3x speedups on standard models with minimal training overhead. More importantly, it established the pattern that would characterize all subsequent work: use the model's own brain to guess ahead, rather than outsourcing to a separate, unfamiliar one.