Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fast-DLLM: Training-Free Acceleration of Diffusion LLM (arxiv.org)
63 points by nathan-barry 22 hours ago | hide | past | favorite | 4 comments




Wait, under everything I’ve read about Diffusion Language Models and demos I’ve also seen and tried, inference is faster than traditional architectures. They state the opposite what gives?

Thats because those demos probably use parallel decoding. In principle, dLLM inference is slower since you have to do bidirectional generation over the whole generation window for each diffusion step. Example; you unmask one token in the 128 window for 128 diffusion steps to generate the full window.

In particular, part of the paper is about dynamically adjusting the number of tokens generated in parallel while maintaining roughly the same output quality as one-token-at-a-time decoding. The other part is about the KV caching strategy they use to speed up parallel decoding further.

What is parallel decoding?



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: