When Numbers Speak: Aligning Textual Numerals and
Visual Instances in Text-to-Video Diffusion Models
Zhengyang Sun1*, Yu Chen1*, Xin Zhou1,3, Xiaofan Li2, Xiwu Chen3†, Dingkang Liang1† and Xiang Bai1✉
1 Huazhong University of Science and Technology, 2 Zhejiang University, 3 Afari Intelligent Drive
(*) equal contribution, (†) project lead, (✉) corresponding author.
NUMINA is a training-free framework that tackles numerical misalignment in text-to-video diffusion models — the persistent failure of T2V models to generate the correct count of objects specified in prompts (e.g., producing 2 or 4 cats when "three cats" is requested). Unlike seed search or prompt enhancement approaches that treat the generation pipeline as a black box and rely on brute-force resampling or LLM-based prompt rewriting, NUMINA directly identifies where and why counting errors occur inside the model by analyzing cross-attention and self-attention maps at selected DiT layers. It constructs a countable spatial layout via a two-stage clustering pipeline, then performs layout-guided attention modulation during regeneration to enforce the correct object count — all without retraining or fine-tuning. On our introduced CountBench, this attention-level intervention provides principled, interpretable control over numerical semantics that seed search and prompt enhancement fundamentally cannot achieve, improves counting accuracy by up to 7.4% on Wan2.1-1.3B. Furthermore, because NUMINA operates partly orthogonally to inference acceleration techniques, it is compatible with training-free caching methods such as EasyCache, which accelerates diffusion inference via runtime-adaptive transformer output reuse.
| Wan2.1-1.3B | Ours | Wan2.1-1.3B | Ours |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Prompt 1: Two kittens playing with two yarn balls. | Prompt 2: Five explorers travelling through a dense jungle. | ||
| Ours | Sora2 | Veo3.1 | Grok Imagine |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| Prompt 3: Three cyclists ride through a trail with three mountain goats. | |||
NUMINA is implemented as a lightweight add-on to Wan2.1. You can set up the environment and integrate the modules by running the following commands:
# Clone Wan2.1 repo
git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1
# From within the Wan2.1 root directory
git clone https://github.com/H-EmbodVis/NUMINA.git numina_repo
# Copy NUMINA modules
cp -r numina_repo/numina ./numina
# Apply modifications to Wan2.1 files
cp numina_repo/wan/modules/attention.py ./wan/modules/attention.py
cp numina_repo/wan/modules/model.py ./wan/modules/model.py
cp numina_repo/wan/text2video.py ./wan/text2video.py
cp numina_repo/generate.py ./generate.py
# install dependencies
pip install -r numina_repo/requirements.txtPlease follow the Wan2.1 README for model checkpoint downloads and any platform-specific setup (e.g., FlashAttention).
Wan2.1/
├── ...
├── numina/ # NUMINA modules (new)
│ ├── __init__.py
│ ├── config.py # All hyperparameters
│ ├── token_mapper.py # Nouns → T5 token index mapping
│ ├── head_selection.py # attention head scoring
│ ├── layout.py # MeanShift + DBSCAN layout pipeline
│ └── modulation.py # Cross-attention bias for SDPA
├── wan/
│ ├── ...
│ ├── modules/
│ │ ├── ...
│ │ ├── attention.py # Modified: extraction + modulation paths
│ │ └── model.py # Modified: NUMINA state propagation
│ └── text2video.py # Modified: two-phase pipeline
└── generate.py # Modified: --numina CLI arguments
Modified Wan2.1 files (4 files):
wan/modules/attention.py— Added manual attention for extraction + SDPA for modulationwan/modules/model.py— Added NUMINA state management and routingwan/text2video.py— Addedgenerate_numina()two-phase pipeline with EasyCache integratedgenerate.py— Added NUMINA CLI arguments
python generate.py \
--task t2v-1.3B \
--ckpt_dir /path/to/Wan2.1-T2V-1.3B \
--prompt "Three men are walking in the park." \
--numina \
--numina_noun_counts '{"men": 3}' \
--size 832*480| Argument | Default | Description |
|---|---|---|
--numina |
False |
Enable NUMINA numerical alignment |
--numina_noun_counts |
(required) | JSON dict, e.g. '{"cats": 3, "dogs": 2}' |
--numina_reference_step |
20 |
Denoising steps for pre-generation |
--numina_reference_layer |
15 |
DiT layer for attention extraction |
For all other arguments (
--offload_model,--t5_cpu,--sample_guide_scale,--base_seed, etc.), please refer to the Wan2.1 documentation.
This project is built on top of Wan2.1 by the Alibaba Wan Team.
Phase 1 pre-generation acceleration uses the EasyCache runtime-adaptive caching strategy.
If you find this repository useful in your research, please consider giving us a star ⭐ and a citation.
@inproceedings{sun2026numina,
title={When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models},
author={Sun, Zhengyang and Chen, Yu and Zhou, Xin and Li, Xiaofan and Chen, Xiwu and Liang, Dingkang and Bai, Xiang},
booktitle={Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition},
year={2026}
}







