Every Trick, All at Once
Or: What happens when you unleash algorithmic parallelism on a single chip
We've now seen four families of techniques, each attacking a different part of the memory wall:
- Speculative decoding amortizes memory trips across multiple tokens (Chapter 1)
- Discrete diffusion converts the problem from memory-bound to compute-bound by generating blocks in parallel (Chapter 2)
- Attention compression shrinks what needs to be stored and moved (Chapter 3)
- Prefill optimization speeds up the input processing phase (Chapter 4)
The natural question: do these techniques conflict? Can you deploy speculative decoding AND compressed attention AND prompt compression all at once?
The answer, surprisingly, is yes โ and they compound.