← Back to Red Deer Investments  ·  AI Library Home

Prefix Caching and Shared Memories

There's one more trick, and it's almost embarrassingly simple.

When many users interact with the same AI system, their prompts often start identically. The system prompt ("You are a helpful assistant who..."), the user's identity preamble, the conversation opener — these are shared across every session with that user, and sometimes across all users of the system.

So why recompute the KV cache for the shared prefix every time?

Prefix caching stores the KV cache for common prefixes once and reuses it across requests. If 1,000 user sessions all start with the same 4,000-token system prompt, you compute the KV cache for that prefix once and save 999 re-computations. Each saved prefill is seconds of wall-clock time, shaved off every user interaction.

The savings grow with the length of the shared prefix and the number of requests. For a popular AI assistant handling millions of conversations, prefix caching can reduce total compute by 30--50%. It's not glamorous, but it's one of the highest-impact optimizations in production serving today.

vLLM, TensorRT-LLM, and SGLang — the major inference serving frameworks — all support prefix caching as a standard feature. It's the silent workhorse of production AI.

← Previous Next →