GQA and MLA: Fewer Storage Shelves, Same Information
The simplest insight: not every attention head needs its own complete copy of every previous token.
In standard multi-head attention, every head (typically 32 or more) has its own independent key and value projections. This means 32 copies of the KV cache, each about 1/32 of the total. Grouped Query Attention (GQA) shares keys and values across groups of query heads. Instead of 32 unique KV pairs, you might have 8, and 4 query heads share each one. The KV cache shrinks by 75%.
The analogy: instead of 32 librarians each keeping a full list of every patron's borrowing history, you have 8 librarians, each covering a department. The science librarian handles all science questions; the fiction librarian handles all fiction questions. Each librarian has their own records, but within a department, the answers are shared.
Multi-Latent Attention (MLA), introduced by DeepSeek in 2024, takes this further. Instead of storing full key and value vectors (each being thousands of numbers), it compresses them into a smaller "latent" space — a compact representation that captures the essential information in fewer numbers. Like writing a one-paragraph summary of a book instead of transcribing it cover to cover. The summary takes less shelf space and can be expanded into full detail when needed.
Today, virtually every production language model uses GQA or MLA. They're standard practice, not exotic innovations. But they only reduce the cache by a constant factor — typically 4--8x. That's not enough when sequences are growing to millions of tokens.