← Back to Red Deer Investments  ยท  AI Library Home

Every Trick, All at Once

Or: What happens when you unleash algorithmic parallelism on a single chip

We've now seen four families of techniques, each attacking a different part of the memory wall:

  • Speculative decoding amortizes memory trips across multiple tokens (Chapter 1)
  • Discrete diffusion converts the problem from memory-bound to compute-bound by generating blocks in parallel (Chapter 2)
  • Attention compression shrinks what needs to be stored and moved (Chapter 3)
  • Prefill optimization speeds up the input processing phase (Chapter 4)

The natural question: do these techniques conflict? Can you deploy speculative decoding AND compressed attention AND prompt compression all at once?

The answer, surprisingly, is yes โ€” and they compound.

← Previous Next →