Skip to content

H-EmbodVis/NUMINA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

When Numbers Speak: Aligning Textual Numerals and
Visual Instances in Text-to-Video Diffusion Models

Zhengyang Sun1*, Yu Chen1*, Xin Zhou1,3, Xiaofan Li2, Xiwu Chen3†, Dingkang Liang1† and Xiang Bai1✉

1 Huazhong University of Science and Technology, 2 Zhejiang University, 3 Afari Intelligent Drive

(*) equal contribution, (†) project lead, (✉) corresponding author.


TL;DR

NUMINA is a training-free framework that tackles numerical misalignment in text-to-video diffusion models — the persistent failure of T2V models to generate the correct count of objects specified in prompts (e.g., producing 2 or 4 cats when "three cats" is requested). Unlike seed search or prompt enhancement approaches that treat the generation pipeline as a black box and rely on brute-force resampling or LLM-based prompt rewriting, NUMINA directly identifies where and why counting errors occur inside the model by analyzing cross-attention and self-attention maps at selected DiT layers. It constructs a countable spatial layout via a two-stage clustering pipeline, then performs layout-guided attention modulation during regeneration to enforce the correct object count — all without retraining or fine-tuning. On our introduced CountBench, this attention-level intervention provides principled, interpretable control over numerical semantics that seed search and prompt enhancement fundamentally cannot achieve, improves counting accuracy by up to 7.4% on Wan2.1-1.3B. Furthermore, because NUMINA operates partly orthogonally to inference acceleration techniques, it is compatible with training-free caching methods such as EasyCache, which accelerates diffusion inference via runtime-adaptive transformer output reuse.


Overview

Overview


Demo

Wan2.1-1.3B Ours Wan2.1-1.3B Ours
Baseline Video 1 Ours Video 1 Baseline Video 2 Ours Video 2
Prompt 1: Two kittens playing with two yarn balls. Prompt 2: Five explorers travelling through a dense jungle.
Ours Sora2 Veo3.1 Grok Imagine
Ours Sora2 Veo3.1 Grok Imagine
Prompt 3: Three cyclists ride through a trail with three mountain goats.

Installation

NUMINA is implemented as a lightweight add-on to Wan2.1. You can set up the environment and integrate the modules by running the following commands:

# Clone Wan2.1 repo
git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1

# From within the Wan2.1 root directory
git clone https://github.com/H-EmbodVis/NUMINA.git numina_repo
# Copy NUMINA modules
cp -r numina_repo/numina  ./numina
# Apply modifications to Wan2.1 files
cp numina_repo/wan/modules/attention.py  ./wan/modules/attention.py
cp numina_repo/wan/modules/model.py      ./wan/modules/model.py
cp numina_repo/wan/text2video.py         ./wan/text2video.py
cp numina_repo/generate.py               ./generate.py

# install dependencies
pip install -r numina_repo/requirements.txt

Please follow the Wan2.1 README for model checkpoint downloads and any platform-specific setup (e.g., FlashAttention).


Project Structure

Wan2.1/
├── ...
├── numina/                          # NUMINA modules (new)
│   ├── __init__.py
│   ├── config.py                    # All hyperparameters
│   ├── token_mapper.py              # Nouns → T5 token index mapping
│   ├── head_selection.py            # attention head scoring
│   ├── layout.py                    # MeanShift + DBSCAN layout pipeline
│   └── modulation.py                # Cross-attention bias for SDPA
├── wan/
│   ├── ...
│   ├── modules/
│   │   ├── ...
│   │   ├── attention.py             # Modified: extraction + modulation paths
│   │   └── model.py                 # Modified: NUMINA state propagation
│   └── text2video.py                # Modified: two-phase pipeline
└── generate.py                      # Modified: --numina CLI arguments

Modified Wan2.1 files (4 files):

  • wan/modules/attention.py — Added manual attention for extraction + SDPA for modulation
  • wan/modules/model.py — Added NUMINA state management and routing
  • wan/text2video.py — Added generate_numina() two-phase pipeline with EasyCache integrated
  • generate.py — Added NUMINA CLI arguments

Quick Start

Basic usage

python generate.py \
  --task t2v-1.3B \
  --ckpt_dir /path/to/Wan2.1-T2V-1.3B \
  --prompt "Three men are walking in the park." \
  --numina \
  --numina_noun_counts '{"men": 3}' \
  --size 832*480

NUMINA-specific arguments

Argument Default Description
--numina False Enable NUMINA numerical alignment
--numina_noun_counts (required) JSON dict, e.g. '{"cats": 3, "dogs": 2}'
--numina_reference_step 20 Denoising steps for pre-generation
--numina_reference_layer 15 DiT layer for attention extraction

For all other arguments (--offload_model, --t5_cpu, --sample_guide_scale, --base_seed, etc.), please refer to the Wan2.1 documentation.


Acknowledgements

This project is built on top of Wan2.1 by the Alibaba Wan Team.
Phase 1 pre-generation acceleration uses the EasyCache runtime-adaptive caching strategy.


Citation

If you find this repository useful in your research, please consider giving us a star ⭐ and a citation.

@inproceedings{sun2026numina,
      title={When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models},
      author={Sun, Zhengyang and Chen, Yu and Zhou, Xin and Li, Xiaofan and Chen, Xiwu and Liang, Dingkang and Bai, Xiang},
      booktitle={Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition},
      year={2026}
}

About

[CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages