Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
Paper | Checkpoint | Data (w/o training images) | Training Images
This repo contains official implementation of Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
📢 Update: This paper will be presented as a poster at ICCV 2025. See you in Honolulu!
-
Install the Sana following instructions from the official repo. We use the commit id
32f495b5b2a464ad3dbbccdc763ec276e881dc05in our experiments. Future verisons may also work, but is untested. -
Install additional dependencies
pip install -r requirements.txt
Please download Checkpoints from our huggingface repo. If you want to run evalutions on GenEval benchmark, please also download the object detector using the official script. Additionally, please download Sana 1.0 base model from huggingface
After downloading all checkpoints, place it in the following structure
-<repo root>
---ckpts
--geneval-det // geneval detector
--mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.pth
--release_weights //our repo
--dit
--vlm
--Sana_1600M_1024px_MultiLing_diffusers //official Sana weight
Please download our evaluation data from this repo.
Additionally, to reproduce training runs, please download images from this different repo.
After downloading all the data, organize it in the following structure
-<repo root>
---data
--custom
--example.csv
--dpg
...
// we provide dpg bench prompts here, alongside our subsets used in appendix table 6
--dit
--gen_eval_sana_train
//this is the folder created by untaring the images
--object_self_correct_cleaned.csv
--vlm
--vllm.json
--geneval
--evaluation_metadata.jsonl
--object_names.txt
--Sana_1600M_1024px_MultiLing_diffusers //official Sana weight
--gen_eval_sana-join-4
//this is the folder created by untaring gen_eval_sana-join-4.tar
//it contains our generated images to reproduce paper results
If you have downloaded training data, there is nothing to be done in this step, as you have set up data folder properly. If you just want to run inference, please create a data folder with following structure
-<repo root>
---data
--custom
--example.csv
where example.csv can be found in custom_data folder of this repo
Run the command
bash inference_scripts/eval_custom.sh
You can add custom prompts to example.csv
First, run the command to generate images
bash inference_scripts/eval_geneval.sh
or
bash inference_scripts/eval_dpg.sh
After that, you can run python inference_scripts/analysis.py path/to/output to get the metrics.
As an example, to reproduce main results of the paper, please run the following command after downloading the evaluation data
$python inference_scripts/analysis.py data/gen_eval_sana-join-4
data/gen_eval_sana-join-4
+------------+--------------------+----------+----------+---------------+--------------------+--------------------+
| color_attr | colors | counting | position | single_object | two_object | overall |
+------------+--------------------+----------+----------+---------------+--------------------+--------------------+
| 0.6 | 0.8829787234042553 | 0.8 | 0.66 | 0.975 | 0.9595959595959596 | 0.8065099457504521 |
+------------+--------------------+----------+----------+---------------+--------------------+--------------------+
+--------------------+--------------------+--------------------+--------------------+--------------------+
| N=4 | N=8 | N=12 | N=16 | N=20 |
+--------------------+--------------------+--------------------+--------------------+--------------------+
| 0.7757685352622061 | 0.7920433996383364 | 0.7992766726943942 | 0.7992766726943942 | 0.8065099457504521 |
+--------------------+--------------------+--------------------+--------------------+--------------------+
Assuming the data and environment is set up properly as described above, You can run the following command to reproduce the training
bash train_scripts/train_reflect_dit.sh
We provide training data in data/vlm in LLAVA Format. The training images is the same as DiT and is placed at data/dit/gen_eval_sana_train.
You can use any codebase that supports this data format to train VLM, such as This repo
- I see
libpng errorforGenEval
Some times mmcv, which is used in GenEval evaluation does not load png properly. To fix this, add --fmt jpg to your inference script
- Where to place
MPLUGused in DPG-Bench Eval?
It will be automatically downloaded by modelscope package. Hence, we did not explictly write it here. Please install modelscope from Official repo for DPG-Bench experiments.
- Where are the Hard subsets of DPG-Bench mentioned in the paper?
Its in data/dpg/subsets
@article{li2025reflect,
title={Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection},
author={Li, Shufan and Kallidromitis, Konstantinos and Gokul, Akash and Koneru, Arsh and Kato, Yusuke and Kozuka, Kazuki and Grover, Aditya},
journal={arXiv preprint arXiv:2503.12271},
year={2025}
}
