Zyphra ZAYA1: The First MoE Diffusion Model

Zyphra, a small research lab with about thirty employees, took TiDAR's recipe and pushed it further. Their ZAYA1-8B model was already a Mixture of Experts architecture — a model that doesn't activate all its parameters for every token, only a subset. This makes it more efficient by default: each token costs less compute and less memory. But could it also be converted to diffusion?

They applied the TiDAR conversion process to their existing ZAYA1-8B checkpoint, then added additional training: 600 billion tokens at 32K context, 500 billion at 128K context, and a diffusion-specific fine-tuning phase. The result: ZAYA1-8B-Diffusion-Preview, the first MoE diffusion model ever converted from an autoregressive LLM.

The speedup: 4.6x with a lossless sampler, and up to 7.7x with their new "mixed-logit" sampler — which trades a small amount of output quality for dramatically more speed, tunable at runtime like a slider between "fast" and "accurate."

They achieve this by diffusing a block of 16 tokens at once, in a single forward pass. Where a standard autoregressive model makes 16 separate trips to memory to generate 16 tokens, ZAYA1-Diffusion makes one trip and gets 16 tokens back.

Zyphra's CCA attention architecture (the subject of the next chapter) plays a key role here. CCA compresses the attention operation into a smaller latent space, reducing the computation required. This means the model can diffuse even more tokens in parallel before hitting a compute bottleneck — the diffusion approach and the compressed attention architecture reinforce each other synergistically.