Speculative Prefill: Guessing the Input

Here's a technique that reverses the logic of everything we've discussed so far.

Speculative decoding (Chapter 1) guesses output tokens to accelerate generation. Symmetrically, speculative prefill guesses input tokens to accelerate prompt processing.

Here's how it works: you use a small, fast model to process the prompt and produce a "draft" KV cache. Then the large target model runs a verification pass — comparing the draft cache against what it would have computed — and only re-computes the parts that were wrong.

The key insight: adjacent tokens influence each other's representations in predictable ways. A small model's KV cache won't be identical to the large model's, but it will be close — close enough that verifying is much cheaper than recomputing from scratch.

Think of it like a sous-chef who pre-peels all the potatoes before the head chef arrives. Most of them will be perfectly peeled; the head chef just needs to trim a few missed spots. The sous-chef does the volume work; the head chef does the quality control.

Early results suggest 1.5--2x prefill speedups on long prompts, with minimal quality impact. The technique is still emerging — the first dedicated papers appeared in late 2025 — but it follows the same economic logic as speculative decoding: amortize the cost of memory access across more work.