← Back to Red Deer Investments  ·  AI Library Home

EAGLE-3: Six Tokens at Once, Losslessly

EAGLE-3, published in March 2025, broke through that ceiling with two key changes.

First: direct token prediction with multi-layer fusion. EAGLE-3 returned to predicting tokens directly (like MEDUSA), but with a crucial upgrade: it grabbed information from multiple layers of the target model — low-level patterns, mid-level concepts, high-level reasoning — and fused them together. The draft model wasn't just reading the target's final output; it was reading the target's mind at every level of abstraction.

Second: training-time test. In earlier versions, the draft model was trained on ground-truth data — shown the correct next tokens and asked to predict them from there. Clean, simple, familiar. But at inference time, the draft model's own predictions were fed back into itself, and errors accumulated. A small mistake at position 2 would compound by position 4. The draft model was being trained on a pristine highway and deployed on a muddy dirt road.

EAGLE-3 changed this. During training, it fed the draft model's own predictions back into itself as input for subsequent steps, with an adjusted attention mask that simulated the real inference-time behavior. The draft model learned to recover from its own mistakes. After this change, accuracy stopped plateauing and started scaling — more training data directly translated to higher acceptance rates.

The result: up to 6.5x speedup over vanilla decoding, with an average acceptance length of 6--7 tokens per verification round. Nearly double EAGLE-2's 4 tokens. And on the largest models (like LLaMA 3.1 8B), EAGLE-3 achieved 4.4x — a 40% improvement over EAGLE-2's 3.16x.

The ceiling in earlier versions wasn't fundamental. It was artificial. The draft model could guess further ahead; it just needed to be trained in the conditions where it would actually operate.

← Previous Next →