Zhenxin Lei, Zhangwei Gao, Changyao Tian, Erfei Cui, Guanzhou Chen, Danni Yang, Yuchen Duan, Zhaokai Wang, Wenhao Li, Weiyun Wang, Xiangyu Zhao, Jiayi Ji, Yu Qiao, Wenhai Wang, Gen Luo
- [2025/10/17] The MetaCaptioner-8B is released!
- [2025/10/16] The paper and project page are released!
Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains.
In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this work proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner.
Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.
Model | MMMU | MMVet | Math Verse | Math Vista | Chart QA | Info VQA | AI2D | Video MME | Cost |
---|---|---|---|---|---|---|---|---|---|
GPT-4.1 | 55.7 | 61.7 | 56.8 | 65.0 | 62.3 | 63.2 | 75.5 | 26.8 | $1.47 |
Capflow with Qwen2.5-VL-72 | 55.1 | 57.8 | 53.1 | 62.5 | 59.2 | 50.2 | 74.2 | 27.6 | $0.14 |
Captioner | LLM | MMB Video | Video MME | Math Vista | Math Verse | Math Vision | SEED2 Plus | Info VQA | MM Star | MMMU | MMB | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen2-VL-7B | DS-Qwen-7B | 0.57 | 20.8 | 47.7 | 40.5 | 31.6 | 56.6 | 46.0 | 43.8 | 42.4 | 54.7 | 37.4 |
InternVL3.5-8B | DS-Qwen-7B | 0.88 | 26.2 | 60.6 | 44.7 | 34.8 | 61.8 | 48.6 | 52.7 | 52.8 | 55.4 | 38.7 |
OmniCaptioner-8B | DS-Qwen-7B | 0.62 | 22.9 | 51.7 | 38.6 | 32.2 | 53.1 | 41.2 | 51.4 | 47.5 | 53.1 | 42.5 |
MetaCaptioner-8B | DS-Qwen-7B | 1.23 | 27.2 | 61.5 | 47.8 | 37.2 | 62.7 | 49.0 | 53.3 | 54.8 | 57.8 | 49.4 |
OmniCaptioner-8B | DS-Qwen-32B | 0.64 | 24.7 | 56.0 | 39.3 | 33.1 | 57.4 | 48.0 | 55.3 | 59.2 | 66.6 | 48.3 |
MetaCaptioner-8B | DS-Qwen-32B | 1.49 | 26.7 | 65.1 | 49.9 | 38.5 | 66.5 | 57.0 | 57.5 | 66.8 | 74.4 | 55.3 |
Model | MMB Video | Video MME | Math Vista | Math Verse | Math Vision | Doc VQA | Chart QA | Info VQA | MM Star | MMMU | MMB | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GLM4.1V-9B | 1.63 | 68.2 | 80.7 | 68.4 | 54.4 | 93.3 | 70.0 | 80.3 | 72.9 | 68.0 | 85.8 | 72.4 |
Keye-VL-8B | - | 67.7 | 80.7 | 54.8 | 50.8 | 87.0 | 72.5 | 63.0 | 72.8 | 71.4 | 76.3 | - |
Qwen2.5-VL-7B | 1.79 | 65.1 | 67.8 | 41.1 | 25.4 | 95.3 | 87.3 | 82.6 | 63.9 | 55.0 | 82.6 | 67.1 |
MiniCPM-V2.6-8B | 1.70 | 60.9 | 73.3 | 35.0 | 21.7 | 90.8 | 82.4 | - | 57.5 | 50.9 | 78.0 | 60.9 |
InternVL3-8B | 1.69 | 66.3 | 71.6 | 39.8 | 29.3 | 92.7 | 86.6 | 76.8 | 68.2 | 62.7 | 81.7 | 66.6 |
InternVL3.5-8B-Instruct | 1.67 | 64.2 | 74.2 | 55.8 | 46.4 | 92.0 | 86.2 | 76.2 | 66.5 | 68.1 | 79.5 | 69.1 |
MetaCaptioner-8B | 1.76 | 64.2 | 75.8 | 56.5 | 52.6 | 93.0 | 86.8 | 76.6 | 66.7 | 69.5 | 80.8 | 71.1 |
- local demo of Capflow
- local demo of MetaCaptioner
- checkpoints of MetaCaptioner-8B
- evaluation demo of captioner
We release the caption pipeline Capflow and the Caption model MetaCaptioner-8B
The Capflow mainly build on vllm, and MetaCaptioner mainly build on lmdeploy and transformers. Using the following command to build the development environment.
# sys environment: cuda==12.1, python==3.10, and torch==2.5.1
# The Capflow mainly build upon vllm==0.7.2 with flash_attn==2.7.3
# The MetaCaptioner mainly build upon lmdeploy==0.10.1 and transformers==4.52.1
conda create -n MetaCap python=3.10 -y
conda activate MetaCap
pip install --upgrade pip
pip install -r requirements.txt
We provide a set of visual captioning pipeline Capflow with 3 main steps: dynamic domain router, visual caption pipeline and post filtering.
Use the following command after the data is prepared:
cd Capflow
# Infer domain for data (optional)
bash script/run_domain.sh
# Infer Caption Pipeline
bash ./script/run_caption.sh
# Post Filtering
bash ./script/run_filter.sh
See more details in Readme.md for data preprocessing and parameter configs.
You can find the checkpoint of MetaCaptioner at MetaCaptioner-8B. After downloading, you can use the following command to run the inference file
cd MetaCaptioner
bash run.sh image 1
See more details in Readme.md.
You can use the following command to evaluate the performance of captioner.
cd VLMEvalkit
bash run.sh
For more examples, please refer to the appendix in our paper.
This project is released under the MIT license.
If you find our work helpful, please consider giving us a ⭐ and citing our paper:
@misc{lei2025metacaptionergeneralistvisualcaptioning,
title={MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites},
author={Zhenxin Lei and Zhangwei Gao and Changyao Tian and Erfei Cui and Guanzhou Chen and Danni Yang and Yuchen Duan and Zhaokai Wang and Wenhao Li and Weiyun Wang and Xiangyu Zhao and Jiayi Ji and Yu Qiao and Wenhai Wang and Gen Luo},
year={2025},
eprint={2510.12126},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.12126},
}