The Prefill Future
Prefill optimization is where speculative decoding was in 2023 โ early, exciting, with clear headroom. The techniques in this chapter are the first generation. The second and third generations will likely:
- Combine prompt compression with speculative prefill (compress first, then draft-verify)
- Use learned routing to skip prefill altogether for known patterns
- Pre-compute KV caches for common system prompts at model deployment time
And the ultimate optimization: eliminate the prefill entirely for recurring patterns. If the same prompt has been processed before, why process it again? Just load the cached KV cache and start generating. This is already standard for system prompts; extending it to user-specific prefixes is the natural next step.