The Combined Effect

Quantization (4x), GQA (4x), and prefix caching (potentially unlimited for shared prefixes) can reduce the effective KV cache per unique token from gigabytes to megabytes. This isn't theory; this is what production systems do today.

Consider the trajectory:

2023: A 200K context on a single GPU was borderline impossible
2024: GQA + INT8 quantization made it achievable with minor quality loss
2025: CCGQA + INT4 quantization made 1M contexts practical on consumer hardware
2026: The combination of all techniques makes multi-million-token contexts visible on the horizon

Each technique independently seems small — a 2x here, a 4x there. But multiplied together, they represent two orders of magnitude of progress from pure engineering. No new physics, no new silicon — just smarter ways to store and retrieve information.

Chapter 4: The Express Lane