Discrete Diffusion Models (DDMs) like LLaDA generate text by iteratively refining a sequence. Two opposing failure modes plague standard sampling strategies:
- Temporal Oscillation (Flickering): The model wavers between valid candidates (e.g., "happy" vs. "glad") across steps. Standard Low-Confidence Remasking (LCR) fails to dampen this because it is memoryless.
-
Stubbornness (Hallucination Lock-in): Running Confidence Remasking (RCR) solves flickering by taking the maximum historical confidence,
$S_t = \max(C_t, S_{t-1})$ . However, this creates "Stubbornness"—if the model hallucinates with high confidence early on, RCR ignores subsequent drops in confidence, locking in the error.
ASMS treats the sampling process as a Kinetic Control Problem. We apply momentum to the confidence trajectory, modulated by semantic stability and entropy, to achieve "Elastic Stability"—resisting noise while yielding to strong negative evidence.
Standard momentum in continuous space (
The Momentum Update Rule becomes:
where
If embedding computation is too costly, or if the embedding space is anisotropic, we can disable semantic gating (
Oscillation typically occurs at low normalized entropy (binary conflicts,
Using
This section has been updated to reflect the "Active Punishment" implementation.
Standard momentum applies equal inertia whether confidence is rising or falling. RCR applies infinite inertia (max-pooling) only when rising. Elastic Mode bridges these by actively punishing drops in confidence.
Instead of just decaying the history, we asymmetrically scale the current change (
Where the coefficients depend on the direction of change:
| Condition |
|
|
|---|---|---|
|
Input Scale |
|
|
|
Buffer Scale |
|
|
By setting
- Result: The token's score tanks, pushing it to the bottom of the sorting queue.
- Benefit: This solves RCR's stubbornness. If the model doubts a token even slightly, Elastic Mode flushes it out immediately, preventing lock-in.
While standard diffusion (MaskGit) relies on "Iterative Correction" (unmasking and re-masking), ASMS achieves superior results (78% vs 60%) using Precision Monotonicity.
- Adaptive Sorting: Instead of correcting mistakes, ASMS focuses on ordering commitments correctly.
- The Mechanism:
- The "Penalty Box" (Elastic Mode) ensures unstable tokens (hallucinations) have very low scores.
- These tokens are forced to the back of the unmasking queue.
- They remain masked until the very end, when maximum context is available to resolve the ambiguity.
- Conclusion: In reasoning tasks (GSM8K), preventing early errors via strict sorting is superior to trying to "erase" errors later.
-
Compute Raw Confidence:
$C_t = P(x_t | x_t^{masked})$ . - Compute Similarity: 𝒮ₜ = CosSim(𝑥ₜ, 𝑥ₜ₋₁)
-
Calculate Delta:
$\Delta C_t = C_t - C_{t-1}$ . -
Apply Elastic Scales: Amplify
$\Delta C_t$ by$\lambda_{down}$ if negative. -
Update Momentum:
$d_t = \alpha \Delta C + \kappa \beta \mathcal{S} d_{t-1}$ . -
Score & Sort:
$S_t = C_t + \lambda d_t$ . - Unmask: Unmask the top-k highest scoring tokens (Monotonic) OR Re-mask the bottom-k (Iterative).