This repository contains the official implementation (including data, scripts and model weights) of HermesFlow.
HermesFlow is a general alignment framework for multimodal LLMs, which cruate homologous preference data itself and utilize self-play iterative optimization with Pair-DPO to seamlessly close the gap between multimodal understanding and generation.
[2025.2] Checkpoint of HermesFlow is publicly available on HuggingFace Repo.
[2025.2] Our main code of HermesFlow is released.
git clone https://github.com/Gen-Verse/HermesFlow
cd HermesFlow
conda create -n HermesFlow python==3.8.10
conda activate HermesFlow
pip install -r requirements.txt
We randomly select 5,000 image-caption pairs from JourneyDB as the homologous input data, and store the detailed information in datasets/journeydb/initial_data.json
in the following format:
[
{
"id": 238,
"img_path": "datasets/journeydb/initial_images/238.jpg",
"prompt": "raccoon wearing a hat made of orange roses wallpaper pattern",
"caption": "A raccoon wearing a hat made of an orange roses wallpaper pattern."
},
]
For the curation of understanding preference data:
python3 inference_mmu_caption.py config=configs/hermesflow_demo_512x512.yaml
Understanding result is saved at datasets/journeydb/understanding_caption_results.json
.
For the curation of generation preference data, first you should generate images according to input prompts:
python3 inference_t2i.py config=configs/hermesflow_demo_512x512.yaml batch_size=1 guidance_scale=5 generation_timesteps=50 mode='t2i'
We recommend using TIFA to complement VQA data for a more comprehensive evaluation of generated images:
python3 get_vqa_tifa.py
Then, use MLLM itself to conduct VQA evaluation on these generated images:
python3 inference_mmu_vqa.py config=configs/hermesflow_demo_512x512.yaml
Generation result is saved at datasets/journeydb/generation_vqa_results.json
.
Finally, get the homologous preference data for Pair-DPO using:
python3 datasets/journeydb/get_dpo_data.py
The final homologous preference data is save at datasets/journeydb/pair_dpo_data.json
in the following format:
[
{
"id": 238,
"img_path": "datasets/journeydb/initial_images/238.jpg",
"prompt": "raccoon wearing a hat made of orange roses wallpaper pattern",
"caption": "A raccoon wearing a hat made of an orange roses wallpaper pattern.",
"caption_win": " A raccoon wearing a hat and standing in front of a floral wallpaper.",
"caption_lose": " The image features a raccoon with an orange hat on, sitting on a table in front of a vase with flowers.",
"bert_score_win": 0.9526261687278748,
"bert_score_lose": 0.5964741706848145,
"image_win": "datasets/journeydb/generated_images/238/5.png",
"image_lose": "datasets/journeydb/generated_images/238/0.png",
"vqa_score_win": 0.667,
"vqa_score_lose": 0.5
},
]
Use Pair-DPO to optimized base-MLLM through:
accelerate launch --config_file accelerate_configs/1_gpu.yaml --main_process_port=8888 training/train_pairdpo.py config=configs/hermesflow_pairdpo.yaml
Once trained, the checkpoint
folder is structured as follows:
├── hermesflow-training-pairdpo_vqa_iteration1/
| ├── ...
| ├── checkpoint-3000
| └── config.yaml
First you should follow the same step before to curate understanding and preference data. Then using this script to update the homologous preference data:
python3 datasets/journeydb/get_dpo_data_iterative.py
The updated homologous preference data is save at datasets/journeydb/pair_dpo_data.json
in the following format:
[
{
"id": 238,
"img_path": "datasets/journeydb/initial_images/238.jpg",
"prompt": "raccoon wearing a hat made of orange roses wallpaper pattern",
"caption": "A raccoon wearing a hat made of an orange roses wallpaper pattern.",
"caption_win": " A raccoon wearing a hat and standing next to a vase of flowers.",
"caption_lose": " The image features a raccoon with an orange hat on, sitting on a table in front of a vase with flowers.",
"bert_score_win": 0.8783621191978455,
"bert_score_lose": 0.5964741706848145,
"image_win": "datasets/journeydb/generated_images_iter2/238/2.png",
"image_lose": "datasets/journeydb/generated_images/238/5.png",
"vqa_score_win": 0.833,
"vqa_score_lose": 0.667
},
]
Finally, use the same training script to optimize MLLM through Pair-DPO.
accelerate launch --config_file accelerate_configs/1_gpu.yaml --main_process_port=8888 training/train_pairdpo.py config=configs/hermesflow_pairdpo.yaml
@article{yang2025hermesflow,
title={HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation},
author={Yang, Ling and Zhang, Xinchen and Tian, Ye and Shang, Chenming and Xu, Minghao and Zhang, Wentao and Cui, Bin},
journal={arXiv preprint arXiv:2502.12148},
year={2025}
}
Our HermesFlow is a general alignment framework for multimodal LLMs, which is builded upon several solid works. Thanks to Show-o and TIFA for their wonderful work and codebase!