The Meaning of Diffusion

Diffusion changes the fundamental economics of inference. With autoregressive generation, every token costs the same — one full memory load. With diffusion, the cost is front-loaded: you pay once to generate the block, and marginal tokens beyond that are nearly free.

This has profound implications for how we think about speed. In the autoregressive world, generating 100 tokens costs 100x the single-token cost. In the diffusion world, generating 100 tokens might cost 6x the single-block cost (because you need multiple refinement steps) plus a small overhead per token. The ratio gets more favorable the longer the output.

For short outputs — a sentence, a single answer — autoregressive is fine. For long outputs — a story, a code generation, a multi-page report — diffusion's advantage grows linearly with length.

Diffusion isn't a complete replacement for autoregressive generation. Pure diffusion models still lag slightly on tasks requiring careful long-range planning — writing a novel chapter where the ending must connect coherently to the beginning. The "generate everything at once" approach means the model must get the global structure right in a single pass, whereas autoregressive models can course-correct as they go. The hybrid approach (TiDAR) solves this by keeping autoregressive verification, but it adds system complexity.

But as a direction — moving from memory-bound sequential generation to compute-bound parallel generation — diffusion represents the most fundamental rethink of how language models produce text since the invention of the transformer.

Chapter 3: The Librarian's Shortcut