The Prefill Future

Prefill optimization is where speculative decoding was in 2023 — early, exciting, with clear headroom. The techniques in this chapter are the first generation. The second and third generations will likely:

Combine prompt compression with speculative prefill (compress first, then draft-verify)
Use learned routing to skip prefill altogether for known patterns
Pre-compute KV caches for common system prompts at model deployment time

And the ultimate optimization: eliminate the prefill entirely for recurring patterns. If the same prompt has been processed before, why process it again? Just load the cached KV cache and start generating. This is already standard for system prompts; extending it to user-specific prefixes is the natural next step.

Chapter 5: The Composite Kitchen