MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

Zhenxin Lei, Zhangwei Gao, Changyao Tian, Erfei Cui, Guanzhou Chen, Danni Yang, Yuchen Duan, Zhaokai Wang, Wenhao Li, Weiyun Wang, Xiangyu Zhao, Jiayi Ji, Yu Qiao, Wenhai Wang, Gen Luo

⭐️ News

[2025/10/17] The MetaCaptioner-8B is released!
[2025/10/16] The paper and project page are released!

📖 Introduction

Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains.

In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this work proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner.

Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.

Overview of CapFlow framework.

🚀 Performance

Comparison between Our Capflow and GPT-4.1

Model	MMMU	MMVet	Math Verse	Math Vista	Chart QA	Info VQA	AI2D	Video MME	Cost
GPT-4.1	55.7	61.7	56.8	65.0	62.3	63.2	75.5	26.8	$1.47
Capflow with Qwen2.5-VL-72	55.1	57.8	53.1	62.5	59.2	50.2	74.2	27.6	$0.14

Comparison of MetaCaptioner and existing captioners under the setting of visual reasoning with LLMs

Captioner	LLM	MMB Video	Video MME	Math Vista	Math Verse	Math Vision	SEED2 Plus	Info VQA	MM Star	MMMU	MMB	AVG
Qwen2-VL-7B	DS-Qwen-7B	0.57	20.8	47.7	40.5	31.6	56.6	46.0	43.8	42.4	54.7	37.4
InternVL3.5-8B	DS-Qwen-7B	0.88	26.2	60.6	44.7	34.8	61.8	48.6	52.7	52.8	55.4	38.7
OmniCaptioner-8B	DS-Qwen-7B	0.62	22.9	51.7	38.6	32.2	53.1	41.2	51.4	47.5	53.1	42.5
MetaCaptioner-8B	DS-Qwen-7B	1.23	27.2	61.5	47.8	37.2	62.7	49.0	53.3	54.8	57.8	49.4
OmniCaptioner-8B	DS-Qwen-32B	0.64	24.7	56.0	39.3	33.1	57.4	48.0	55.3	59.2	66.6	48.3
MetaCaptioner-8B	DS-Qwen-32B	1.49	26.7	65.1	49.9	38.5	66.5	57.0	57.5	66.8	74.4	55.3

Direct performance comparison between MetaCaptioner and existing MLLMs.

Model	MMB Video	Video MME	Math Vista	Math Verse	Math Vision	Doc VQA	Chart QA	Info VQA	MM Star	MMMU	MMB	AVG
GLM4.1V-9B	1.63	68.2	80.7	68.4	54.4	93.3	70.0	80.3	72.9	68.0	85.8	72.4
Keye-VL-8B	-	67.7	80.7	54.8	50.8	87.0	72.5	63.0	72.8	71.4	76.3	-
Qwen2.5-VL-7B	1.79	65.1	67.8	41.1	25.4	95.3	87.3	82.6	63.9	55.0	82.6	67.1
MiniCPM-V2.6-8B	1.70	60.9	73.3	35.0	21.7	90.8	82.4	-	57.5	50.9	78.0	60.9
InternVL3-8B	1.69	66.3	71.6	39.8	29.3	92.7	86.6	76.8	68.2	62.7	81.7	66.6
InternVL3.5-8B-Instruct	1.67	64.2	74.2	55.8	46.4	92.0	86.2	76.2	66.5	68.1	79.5	69.1
MetaCaptioner-8B	1.76	64.2	75.8	56.5	52.6	93.0	86.8	76.6	66.7	69.5	80.8	71.1

👨‍💻 Todo

local demo of Capflow
local demo of MetaCaptioner
checkpoints of MetaCaptioner-8B
evaluation demo of captioner

🛠️ Usage

We release the caption pipeline Capflow and the Caption model MetaCaptioner-8B

1. Prerequisites

The Capflow mainly build on vllm, and MetaCaptioner mainly build on lmdeploy and transformers. Using the following command to build the development environment.

# sys environment: cuda==12.1, python==3.10, and torch==2.5.1
# The Capflow mainly build upon vllm==0.7.2 with flash_attn==2.7.3
# The MetaCaptioner mainly build upon lmdeploy==0.10.1 and transformers==4.52.1

conda create -n MetaCap python=3.10 -y
conda activate MetaCap

pip install --upgrade pip
pip install -r requirements.txt

2. Capflow: General Visual Caption Pipeline

We provide a set of visual captioning pipeline Capflow with 3 main steps: dynamic domain router, visual caption pipeline and post filtering.

Use the following command after the data is prepared:

cd Capflow
# Infer domain for data (optional)
bash script/run_domain.sh
# Infer Caption Pipeline
bash ./script/run_caption.sh
# Post Filtering
bash ./script/run_filter.sh

See more details in Readme.md for data preprocessing and parameter configs.

3. MetaCaptioner

You can find the checkpoint of MetaCaptioner at MetaCaptioner-8B. After downloading, you can use the following command to run the inference file

cd MetaCaptioner
bash run.sh image 1

See more details in Readme.md.

4. Evaluation

You can use the following command to evaluate the performance of captioner.

cd VLMEvalkit
bash run.sh

🖼 Examples of Capflow and MetaCaptioner

For more examples, please refer to the appendix in our paper.

Visualization Result

📃 License

This project is released under the MIT license.

🖊️ Citation

If you find our work helpful, please consider giving us a ⭐ and citing our paper:

@misc{lei2025metacaptionergeneralistvisualcaptioning,
      title={MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites}, 
      author={Zhenxin Lei and Zhangwei Gao and Changyao Tian and Erfei Cui and Guanzhou Chen and Danni Yang and Yuchen Duan and Zhaokai Wang and Wenhao Li and Weiyun Wang and Xiangyu Zhao and Jiayi Ji and Yu Qiao and Wenhai Wang and Gen Luo},
      year={2025},
      eprint={2510.12126},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.12126}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

⭐️ News

📖 Introduction

🚀 Performance

Comparison between Our Capflow and GPT-4.1

Comparison of MetaCaptioner and existing captioners under the setting of visual reasoning with LLMs

Direct performance comparison between MetaCaptioner and existing MLLMs.

👨‍💻 Todo

🛠️ Usage

1. Prerequisites

2. Capflow: General Visual Caption Pipeline

3. MetaCaptioner

4. Evaluation

🖼 Examples of Capflow and MetaCaptioner

📃 License

🖊️ Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Capflow		Capflow
MetaCaptioner		MetaCaptioner
VLMEvalKit		VLMEvalKit
data		data
Readme.md		Readme.md
requirements.txt		requirements.txt

OpenGVLab/MetaCaptioner

Folders and files

Latest commit

History

Repository files navigation

MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

⭐️ News

📖 Introduction

🚀 Performance

Comparison between Our Capflow and GPT-4.1

Comparison of MetaCaptioner and existing captioners under the setting of visual reasoning with LLMs

Direct performance comparison between MetaCaptioner and existing MLLMs.

👨‍💻 Todo

🛠️ Usage

1. Prerequisites

2. Capflow: General Visual Caption Pipeline

3. MetaCaptioner

4. Evaluation

🖼 Examples of Capflow and MetaCaptioner

📃 License

🖊️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages