Click the links below to view our paper:
If you find this work useful, please cite our paper and give us a shining star 🌟
@misc{dai2026revealingattentionfloatingmechanism,
title={Revealing the Attention Floating Mechanism in Masked Diffusion Models},
author={Xin Dai and Pengcheng Huang and Zhenghao Liu and Shuo Wang and Yukun Yan and Chaojun Xiao and Yu Gu and Ge Yu and Maosong Sun},
year={2026},
eprint={2601.07894},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.07894},
}
Attention floating is a mechanistic perspective on how masked diffusion models (MDMs) allocate attention under iterative denoising with bidirectional visibility. Unlike autoregressive models, where attention often collapses into a rigid early-token sink that can bias information flow and exacerbate lost-in-the-middle behavior, MDMs exhibit distributed attention anchors that drift across layers and denoising steps. We further show a Shallow Structure-Aware, Deep Content-Focused pattern: shallow layers rely on structurally salient floating tokens to scaffold global organization, while deeper layers increasingly shift capacity toward semantically informative content, yielding stronger context tracking and large gains on knowledge-intensive tasks.
conda create --name attention_floating python==3.13
conda activate attention_floating
git clone https://github.com/NEUIR/Attention_Floating.git
cd Attention_Floating
pip install -r requirement.txtWe first conduct a comprehensive study on MDMs, including Llada and Dream. We provide (i) analysis/visualization scripts for attention absorption, temporal drift, QK decomposition, retrieval head analysis and region-level attention flow, and (ii) evaluation scripts for knowledge-intensive QA with/without RAG.
Create temporal heatmaps over denoising steps (MDMs) from step_attention_data:
python visualization/create_temporal.py \
--npz /path/to/sample_attentions.npz \
--output /path/to/out \
--layer 0 \
--model /path/to/model
![]() (a) Layer 0. |
![]() (b) Layer 31. |
We observe that:
- Step-dependent anchors: attention floating exists and gradually shifts right over denoising steps at each layer.
- Layer-dependent anchors: shallow layers show more spread-out attention with multiple active anchors; deep layers become much sparser and concentrate on fewer, sharper anchor positions.
- Task-dependent anchors: the floating positions differ across tasks.
We quantify how much attention mass is absorbed by sink/floating positions using:
where
python visualization/absorption_comparison.pyARMs induce a rigid concentration of attention around the sink token <BOS>, whereas MDMs display a weaker and more distributed absorption pattern.
To systematically understand Attention Floating in MDMs, we start from the pre-softmax attention scoring mechanism (QK). Prior work on attention sinks in ARMs shows that the salience of sink positions mainly comes from a systematic advantage in the directional term (i.e.,
Motivated by this, we decompose the QK dot product as:
$$
QK = |Q|,|K| \cos\theta,
$$
where
python visualization/QK_decomposition.pyThe QK advantage of floating positions evolves from being combined effect of angular alignment + norm amplification in shallow layers to a primarily angle-driven in deeper layers.
To verify the Shallow Structure-Aware, Deep Content-Focused hypothesis, we further analyze retrieval-specialized attention heads following Retrieval Head Mechanistically Explains Long-Context Factuality (Wu et al., 2024). Concretely, we assign each attention head a retrieval score based on final answer generation. A higher score means the head more consistently focuses on the key evidence while producing the answer, i.e., it behaves more like a retrieval/evidence-tracking head. We visualize retrieval scores as heatmaps over layers and heads (two figures below):
![]() (a) Llada. |
![]() (b) Dream. |
We evaluate ARMs (Llama, Qwen) and MDMs (Dream, Llada) on NQ / TQA / MarcoQA / HotpotQA / T-REx with a unified prompt template.
python evaluate/evaluate_*.pyNote: the evaluation scripts contain placeholder paths (e.g., dataset_dir, model_path, output_dir). Please edit them before running.
We observe that MDMs w/ RAG achieve over 19.5% average improvement compared to their corresponding baseline models, which is more than twice the gain observed for ARMs when augmented with retrieval (ARMs w/ RAG obtain 8.5% improvements).
We pool token-level attention into coarse regions (BOS / Query / Docs / Answer) and apply rollout:
python visualization/attention_rollout.py
![]() (a) LLaDA, Gold Doc at 1. |
![]() (b) LLaDA, Gold Doc at 5. |
![]() (c) LLaDA, Gold Doc at 10. |
![]() (d) LLaMA, Gold Doc at 1/5/10. |
If you have questions, suggestions, and bug reports, please email:
daix1@mails.neu.edu.cn











