← Back to Red Deer Investments  ·  AI Library Home

CCA: Compressed Convolutional Attention

Zyphra's CCA (Compressed Convolutional Attention), published in October 2025, takes the compression idea to its logical extreme. Instead of just compressing keys and values (as MLA does), CCA compresses everything — queries, keys, and values — into a small shared latent space. The entire attention operation — the comparison, the weighting, the aggregation — happens inside this compressed space. You only expand back to full size for the final output.

The effect is dramatic: 8x KV-cache compression with no quality loss compared to full multi-head attention. On certain MoE architectures, CCA even outperforms full attention while using an 8x smaller cache — the compression acts as a regularizer, forcing the model to focus on the most important patterns and ignore noise.

And because the compressed space is smaller, the computational cost drops too. CCA achieves 1.7x faster prefill on H100 GPUs at 16K sequence length, and accelerates the backward pass by 1.3x. It's one of those rare optimizations that improves both speed and memory simultaneously — normally you trade one for the other.

CCGQA (Compressed Convolutional Grouped Query Attention) combines CCA with GQA's head-sharing. The result is a flexible architecture where you can tune the compression ratio to your exact workload: more compression for memory-constrained serving, less compression for quality-critical applications.

The librarian analogy now extends further. Standard attention is like our librarian memorizing every book verbatim. GQA stores chapter summaries. MLA stores paragraph summaries. CCA stores a single sentence that captures the essence and reconstructs details on demand — and does it so efficiently that you'd never guess the original information was compressed.

← Previous Next →