Compressing the KV Cache
Or: How to remember an entire conversation without memorizing every word
Picture a librarian who serves ten thousand patrons. Each patron borrows books and returns them. The librarian needs to remember every book every patron has ever checked out — not because they might need to mention it again, but because every future interaction builds on past ones.
This is the KV cache. It's the model's memory of everything that has been said in the current conversation. Every previous token — every word of the prompt, every word the model has generated so far — has an associated key and value vector stored in this cache. When generating the next token, the model queries this cache to determine which previous tokens are relevant to what it's about to say.
The KV cache is enormous. An 8-billion-parameter model generating a 128K-token context might need 40--80 gigabytes just for this cache — more than most consumer GPUs have. And it grows linearly with sequence length. Double the context, double the cache. Triple it, triple the cache.
This is the memory wall in its purest form. The model's compute isn't the bottleneck; storing and retrieving the conversation history is. The chef isn't waiting for the knife; the chef is waiting to get back to the walk-in freezer.
Researchers have attacked this problem from multiple angles. The goal is always the same: store less, retrieve smarter, keep the quality.