← Back to Red Deer Investments  ·  AI Library Home

The Combined Effect

Quantization (4x), GQA (4x), and prefix caching (potentially unlimited for shared prefixes) can reduce the effective KV cache per unique token from gigabytes to megabytes. This isn't theory; this is what production systems do today.

Consider the trajectory:

  • 2023: A 200K context on a single GPU was borderline impossible
  • 2024: GQA + INT8 quantization made it achievable with minor quality loss
  • 2025: CCGQA + INT4 quantization made 1M contexts practical on consumer hardware
  • 2026: The combination of all techniques makes multi-million-token contexts visible on the horizon

Each technique independently seems small — a 2x here, a 4x there. But multiplied together, they represent two orders of magnitude of progress from pure engineering. No new physics, no new silicon — just smarter ways to store and retrieve information.


Chapter 4: The Express Lane

← Previous Next →