Optimizing the Inbound Trip
Or: Why the chef who preps ingredients before service starts is always faster
There's a scene in every restaurant documentary: the chef arrives at 6 AM, hours before the first customer. They unpack deliveries, wash vegetables, portion proteins, prepare sauces. By the time the first order arrives, everything is ready. All that's left is the final assembly.
This is prefill.
Inference has two phases: prefill (processing the input prompt) and decode (generating the output). Everyone obsesses over decode — it's the visible part, where tokens stream out one at a time. But prefill is becoming the hidden bottleneck.
During prefill, the model processes the entire input in one shot, computing the KV cache for every prompt token simultaneously. This is compute-bound — all the GPU's cores fire at once — and for short prompts it's nearly instant. But for long prompts — think 100K tokens, a whole document, a full codebase — prefill can take 10--30 seconds before a single output token appears.
That's an eternity for interactive use. And the trend is toward longer prompts. System prompts are pages long. Chat histories accumulate. Documents get attached. The prefill phase, once an afterthought, is becoming the dominant latency component.