The Stacking Principle
Each technique addresses a different bottleneck, meaning their benefits multiply rather than add.
Consider a hypothetical inference stack:
| Layer | Technique | Speedup | What it solves |
|---|---|---|---|
| 1 | Prefix caching | 1.5--2x | Shared prompt recomputation |
| 2 | Prompt compression | 2--5x | Long input latency |
| 3 | KV cache quantization (INT4) | 2--4x | Memory footprint per token |
| 4 | CCA compressed attention | 2x | Attention compute overhead |
| 5 | EAGLE-3 speculative decoding | 4--6x | Decode phase memory trips |
| 6 | Dynamic batching | 2x | GPU utilization across requests |
The math: 2 × 3 × 3 × 2 × 5 × 2 = 360x. A model that generated 10 tokens per second in 2023 can, with all techniques applied, deliver well over 1,000 tokens per second on the same hardware.
That's not marketing. These techniques are all deployed in production today — some as research prototypes, some polished into mature systems. The numbers are real.