Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding

Daisuke Oba¹, Danushka Bollegala^2,3, Masahiro Kaneko⁴, Naoaki Okazaki¹

¹Institute of Science Tokyo ²University of Liverpool ³Amazon ⁴MBZUAI
Accepted at ICLR 2026 🇧🇷

Paper (coming soon) Code (coming soon) arXiv (coming soon)

Abstract

Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step—even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position—thereafter skipping its query projection and feed-forward sublayers—while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from O(N²d) to O(M N d) where N is the sequence length, M is the number of unlocked token positions, and d is the model dimension. In practice, M decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30-50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities.

Methodology

Conceptual figure of iterative sampling. (a) The normal sampler (baseline) recomputes attention scores and FFN sublayers for every token position at every step, even after many tokens have become unmasked. (b) SureLock permanently stops recomputing for locked positions once they stabilize; cached K/V still allows other tokens to attend to them.

Algorithms of our method; SureLock.

Experiments

Step-wise FLOPs ratio. The ratio of step-wise algorithmic FLOPs consistently decreases as steps proceed, explaining later-step savings of computational cost. ε means the locking threshold.

Throughput behavior with SureLock. Per-step end-to-end TPS ratio increases as sampling progresses.

Baseline vs SureLock response comparison

Comparison of responses between Baseline vs. SureLock on LLaDA-8B-Instruct. The question is sampled from MT-bench with question id=119.