Discrete Diffusion and Parallel Generation

Or: Why sculptors finish faster than potters

Imagine you're writing a letter using a typewriter that can only type one character at a time. Each keystroke requires a full second of mechanical work. A 500-word letter takes about fifty minutes. That's autoregressive generation: one token at a time, each requiring a full trip to memory.

Now imagine a different kind of typewriter. This one starts by covering the page with a dark smudge — every character position filled with a random scribble. Then, on each pass, it looks at the entire page simultaneously and brightens some characters, darkens others, clarifies edges, fills in shapes. After a few passes, the smudge resolves into clear, readable text. Every character on the page was generated in parallel, not one at a time.

This is discrete diffusion — and it's the most radical departure from the "one token at a time" paradigm since the transformer was invented.