← Back to Red Deer Investments  ยท  AI Library Home

But Don't They Interfere?

A reasonable concern: speculative decoding and diffusion approaches operate in the same space. Can you apply both at once?

The answer is yes, but carefully. TiDAR's hybrid approach (diffusion drafting + autoregressive verification) is itself a form of speculation โ€” it just uses the same model's diffusion mode as the draft mechanism instead of a separate draft model. The two approaches share the same fundamental structure: draft in parallel, verify in sequence.

In a truly stacked system, you might use:

  1. CCGQA for attention compression (reducing memory per KV entry)
  2. INT4 quantization for the compressed cache (further reducing memory)
  3. TiDAR diffusion for draft generation (producing blocks of 16 draft tokens)
  4. Autoregressive verification for acceptance (guaranteeing correctness)
  5. Prefix caching for shared prompt reuse (eliminating redundant pre-compute)
  6. Prompt compression for long inputs (shortening the prefill phase before any of the above starts)

Each technique operates at a different layer of the stack. CCGQA changes the attention mechanism itself. Quantization changes the storage format. TiDAR changes the generation algorithm. Prefix caching changes the caching strategy. They don't touch each other's code paths; they address different constraints.

← Previous Next →