Revealing the Attention Floating Mechanism in Masked Diffusion Models

Click the links below to view our paper:

If you find this work useful, please cite our paper and give us a shining star 🌟

@misc{dai2026revealingattentionfloatingmechanism,
      title={Revealing the Attention Floating Mechanism in Masked Diffusion Models}, 
      author={Xin Dai and Pengcheng Huang and Zhenghao Liu and Shuo Wang and Yukun Yan and Chaojun Xiao and Yu Gu and Ge Yu and Maosong Sun},
      year={2026},
      eprint={2601.07894},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.07894}, 
}

📖 Introduction

Attention floating is a mechanistic perspective on how masked diffusion models (MDMs) allocate attention under iterative denoising with bidirectional visibility. Unlike autoregressive models, where attention often collapses into a rigid early-token sink that can bias information flow and exacerbate lost-in-the-middle behavior, MDMs exhibit distributed attention anchors that drift across layers and denoising steps. We further show a Shallow Structure-Aware, Deep Content-Focused pattern: shallow layers rely on structurally salient floating tokens to scaffold global organization, while deeper layers increasingly shift capacity toward semantically informative content, yielding stronger context tracking and large gains on knowledge-intensive tasks.

🎉 News

20260114: Released our Paper on arXiv. Released our Code on GitHub.

⚙️ Setup

conda create --name attention_floating python==3.13
conda activate attention_floating
git clone https://github.com/NEUIR/Attention_Floating.git
cd Attention_Floating
pip install -r requirement.txt

Attention Floating in Masked Diffusion Models (MDMs)

We first conduct a comprehensive study on MDMs, including Llada and Dream. We provide (i) analysis/visualization scripts for attention absorption, temporal drift, QK decomposition, retrieval head analysis and region-level attention flow, and (ii) evaluation scripts for knowledge-intensive QA with/without RAG.

1) Temporal Drift Visualization (Per-step Floating)

Create temporal heatmaps over denoising steps (MDMs) from step_attention_data:

python visualization/create_temporal.py \
  --npz /path/to/sample_attentions.npz \
  --output /path/to/out \
  --layer 0 \
  --model /path/to/model

(a) Layer 0.

(b) Layer 31.

We observe that:

Step-dependent anchors: attention floating exists and gradually shifts right over denoising steps at each layer.
Layer-dependent anchors: shallow layers show more spread-out attention with multiple active anchors; deep layers become much sparser and concentrate on fewer, sharper anchor positions.
Task-dependent anchors: the floating positions differ across tasks.

2) Absorption Rate Comparison across Layers

We quantify how much attention mass is absorbed by sink/floating positions using:

$$ \text{Absorb}(S,\ell)=\sum_{j\in S} A^\ell_j \times 100% $$

where $A^\ell_j$ is the head-averaged received attention at position $j$.

python visualization/absorption_comparison.py

ARMs induce a rigid concentration of attention around the sink token <BOS>, whereas MDMs display a weaker and more distributed absorption pattern.

3) QK Decomposition

To systematically understand Attention Floating in MDMs, we start from the pre-softmax attention scoring mechanism (QK). Prior work on attention sinks in ARMs shows that the salience of sink positions mainly comes from a systematic advantage in the directional term (i.e., $\cos\theta$), while column-wise differences in the scale term (i.e., $|Q||K|$) are comparatively weaker—often summarized as a form of key bias.
Motivated by this, we decompose the QK dot product as: $$ QK = |Q|,|K| \cos\theta, $$ where $\theta$ is the angle between $Q$ and $K$. We can disentangle whether the QK advantage of floating key columns in MDMs is driven by stronger angular alignment, scale amplification, or their combined effect across depth.

python visualization/QK_decomposition.py

The QK advantage of floating positions evolves from being combined effect of angular alignment + norm amplification in shallow layers to a primarily angle-driven in deeper layers.

4) Retrieval Head Analysis

To verify the Shallow Structure-Aware, Deep Content-Focused hypothesis, we further analyze retrieval-specialized attention heads following Retrieval Head Mechanistically Explains Long-Context Factuality (Wu et al., 2024). Concretely, we assign each attention head a retrieval score based on final answer generation. A higher score means the head more consistently focuses on the key evidence while producing the answer, i.e., it behaves more like a retrieval/evidence-tracking head. We visualize retrieval scores as heatmaps over layers and heads (two figures below):

(a) Llada.

(b) Dream.

5) Performance in Learning Knowledge from Contexts

We evaluate ARMs (Llama, Qwen) and MDMs (Dream, Llada) on NQ / TQA / MarcoQA / HotpotQA / T-REx with a unified prompt template.

python evaluate/evaluate_*.py

Note: the evaluation scripts contain placeholder paths (e.g., dataset_dir, model_path, output_dir). Please edit them before running.

We observe that MDMs w/ RAG achieve over 19.5% average improvement compared to their corresponding baseline models, which is more than twice the gain observed for ARMs when augmented with retrieval (ARMs w/ RAG obtain 8.5% improvements).

6) Region-level Attention Flow

We pool token-level attention into coarse regions (BOS / Query / Docs / Answer) and apply rollout:

python visualization/attention_rollout.py

(a) LLaDA, Gold Doc at 1.	(b) LLaDA, Gold Doc at 5.
(c) LLaDA, Gold Doc at 10.	(d) LLaMA, Gold Doc at 1/5/10.

Contact

If you have questions, suggestions, and bug reports, please email:

daix1@mails.neu.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
datasets		datasets
evaluate		evaluate
figs		figs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Revealing the Attention Floating Mechanism in Masked Diffusion Models

📖 Introduction

🎉 News

⚙️ Setup

Attention Floating in Masked Diffusion Models (MDMs)

1) Temporal Drift Visualization (Per-step Floating)

2) Absorption Rate Comparison across Layers

3) QK Decomposition

4) Retrieval Head Analysis

5) Performance in Learning Knowledge from Contexts

6) Region-level Attention Flow

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Revealing the Attention Floating Mechanism in Masked Diffusion Models

📖 Introduction

🎉 News

⚙️ Setup

Attention Floating in Masked Diffusion Models (MDMs)

1) Temporal Drift Visualization (Per-step Floating)

2) Absorption Rate Comparison across Layers

3) QK Decomposition

4) Retrieval Head Analysis

5) Performance in Learning Knowledge from Contexts

6) Region-level Attention Flow

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages