Skip to content

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Notifications You must be signed in to change notification settings

Gen-Verse/HermesFlow

Repository files navigation

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

This repository contains the official implementation (including data, scripts and model weights) of HermesFlow.

Introduction

Alt text

HermesFlow is a general alignment framework for multimodal LLMs, which cruate homologous preference data itself and utilize self-play iterative optimization with Pair-DPO to seamlessly close the gap between multimodal understanding and generation. alt text alt text

New Updates

[2025.2] Checkpoint of HermesFlow is publicly available on HuggingFace Repo.

[2025.2] Our main code of HermesFlow is released.

Installation

git clone https://github.com/Gen-Verse/HermesFlow
cd HermesFlow
conda create -n HermesFlow python==3.8.10
conda activate HermesFlow
pip install -r requirements.txt

Curate Homologous Preference Data

We randomly select 5,000 image-caption pairs from JourneyDB as the homologous input data, and store the detailed information in datasets/journeydb/initial_data.json in the following format:

[
    {
        "id": 238,
        "img_path": "datasets/journeydb/initial_images/238.jpg",
        "prompt": "raccoon wearing a hat made of orange roses wallpaper pattern",
        "caption": "A raccoon wearing a hat made of an orange roses wallpaper pattern."
    },
 ]

For the curation of understanding preference data:

python3 inference_mmu_caption.py config=configs/hermesflow_demo_512x512.yaml

Understanding result is saved at datasets/journeydb/understanding_caption_results.json .

For the curation of generation preference data, first you should generate images according to input prompts:

python3 inference_t2i.py config=configs/hermesflow_demo_512x512.yaml batch_size=1 guidance_scale=5 generation_timesteps=50 mode='t2i'

We recommend using TIFA to complement VQA data for a more comprehensive evaluation of generated images:

python3 get_vqa_tifa.py

Then, use MLLM itself to conduct VQA evaluation on these generated images:

python3 inference_mmu_vqa.py config=configs/hermesflow_demo_512x512.yaml

Generation result is saved at datasets/journeydb/generation_vqa_results.json .

Finally, get the homologous preference data for Pair-DPO using:

python3 datasets/journeydb/get_dpo_data.py

The final homologous preference data is save at datasets/journeydb/pair_dpo_data.json in the following format:

[
  	{
        "id": 238,
        "img_path": "datasets/journeydb/initial_images/238.jpg",
        "prompt": "raccoon wearing a hat made of orange roses wallpaper pattern",
        "caption": "A raccoon wearing a hat made of an orange roses wallpaper pattern.",
        "caption_win": " A raccoon wearing a hat and standing in front of a floral wallpaper.",
        "caption_lose": " The image features a raccoon with an orange hat on, sitting on a table in front of a vase with flowers.",
        "bert_score_win": 0.9526261687278748,
        "bert_score_lose": 0.5964741706848145,
        "image_win": "datasets/journeydb/generated_images/238/5.png",
        "image_lose": "datasets/journeydb/generated_images/238/0.png",
        "vqa_score_win": 0.667,
        "vqa_score_lose": 0.5
    },
]

Pair-DPO Training

Use Pair-DPO to optimized base-MLLM through:

accelerate launch --config_file accelerate_configs/1_gpu.yaml --main_process_port=8888 training/train_pairdpo.py config=configs/hermesflow_pairdpo.yaml

Once trained, the checkpoint folder is structured as follows:

├── hermesflow-training-pairdpo_vqa_iteration1/ 
|   ├── ...
|   ├── checkpoint-3000
|   └── config.yaml

Iterative Optimization

First you should follow the same step before to curate understanding and preference data. Then using this script to update the homologous preference data:

python3 datasets/journeydb/get_dpo_data_iterative.py

The updated homologous preference data is save at datasets/journeydb/pair_dpo_data.json in the following format:

[
		{
        "id": 238,
        "img_path": "datasets/journeydb/initial_images/238.jpg",
        "prompt": "raccoon wearing a hat made of orange roses wallpaper pattern",
        "caption": "A raccoon wearing a hat made of an orange roses wallpaper pattern.",
        "caption_win": " A raccoon wearing a hat and standing next to a vase of flowers.",
        "caption_lose": " The image features a raccoon with an orange hat on, sitting on a table in front of a vase with flowers.",
        "bert_score_win": 0.8783621191978455,
        "bert_score_lose": 0.5964741706848145,
        "image_win": "datasets/journeydb/generated_images_iter2/238/2.png",
        "image_lose": "datasets/journeydb/generated_images/238/5.png",
        "vqa_score_win": 0.833,
        "vqa_score_lose": 0.667
    },
]

Finally, use the same training script to optimize MLLM through Pair-DPO.

accelerate launch --config_file accelerate_configs/1_gpu.yaml --main_process_port=8888 training/train_pairdpo.py config=configs/hermesflow_pairdpo.yaml

Citation

@article{yang2025hermesflow,
  title={HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation},
  author={Yang, Ling and Zhang, Xinchen and Tian, Ye and Shang, Chenming and Xu, Minghao and Zhang, Wentao and Cui, Bin},
  journal={arXiv preprint arXiv:2502.12148},
  year={2025}
}

Acknowledgements

Our HermesFlow is a general alignment framework for multimodal LLMs, which is builded upon several solid works. Thanks to Show-o and TIFA for their wonderful work and codebase!

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages