Skip to content

OpenGVLab/MetaCaptioner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites



teaser

⭐️ News

📖 Introduction

Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains.

In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this work proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner.

Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.

overview Overview of CapFlow framework.

🚀 Performance

Comparison between Our Capflow and GPT-4.1

Model         MMMUMMVetMath VerseMath Vista Chart QAInfo VQAAI2DVideo MME Cost
GPT-4.155.761.756.865.062.363.275.526.8$1.47
Capflow with Qwen2.5-VL-7255.157.853.162.559.250.274.227.6$0.14

Comparison of MetaCaptioner and existing captioners under the setting of visual reasoning with LLMs

Captioner LLM MMB Video Video MME Math Vista Math Verse Math Vision SEED2 Plus Info VQA MM Star MMMU MMB AVG
Qwen2-VL-7B DS-Qwen-7B 0.57 20.8 47.7 40.5 31.6 56.6 46.0 43.8 42.4 54.7 37.4
InternVL3.5-8B DS-Qwen-7B 0.88 26.2 60.6 44.7 34.8 61.8 48.6 52.7 52.8 55.4 38.7
OmniCaptioner-8B DS-Qwen-7B 0.62 22.9 51.7 38.6 32.2 53.1 41.2 51.4 47.5 53.1 42.5
MetaCaptioner-8B DS-Qwen-7B 1.23 27.2 61.5 47.8 37.2 62.7 49.0 53.3 54.8 57.8 49.4
OmniCaptioner-8B DS-Qwen-32B 0.64 24.7 56.0 39.3 33.1 57.4 48.0 55.3 59.2 66.6 48.3
MetaCaptioner-8B DS-Qwen-32B 1.49 26.7 65.1 49.9 38.5 66.5 57.0 57.5 66.8 74.4 55.3

Direct performance comparison between MetaCaptioner and existing MLLMs.

Model MMB Video Video MME Math Vista Math Verse Math Vision Doc VQA Chart QA Info VQA MM Star MMMU MMB AVG
GLM4.1V-9B 1.63 68.2 80.7 68.4 54.4 93.3 70.0 80.3 72.9 68.0 85.8 72.4
Keye-VL-8B - 67.7 80.7 54.8 50.8 87.0 72.5 63.0 72.8 71.4 76.3 -
Qwen2.5-VL-7B 1.79 65.1 67.8 41.1 25.4 95.3 87.3 82.6 63.9 55.0 82.6 67.1
MiniCPM-V2.6-8B 1.70 60.9 73.3 35.0 21.7 90.8 82.4 - 57.5 50.9 78.0 60.9
InternVL3-8B 1.69 66.3 71.6 39.8 29.3 92.7 86.6 76.8 68.2 62.7 81.7 66.6
InternVL3.5-8B-Instruct 1.67 64.2 74.2 55.8 46.4 92.0 86.2 76.2 66.5 68.1 79.5 69.1
MetaCaptioner-8B 1.76 64.2 75.8 56.5 52.6 93.0 86.8 76.6 66.7 69.5 80.8 71.1

👨‍💻 Todo

  • local demo of Capflow
  • local demo of MetaCaptioner
  • checkpoints of MetaCaptioner-8B
  • evaluation demo of captioner

🛠️ Usage

We release the caption pipeline Capflow and the Caption model MetaCaptioner-8B

1. Prerequisites

The Capflow mainly build on vllm, and MetaCaptioner mainly build on lmdeploy and transformers. Using the following command to build the development environment.

# sys environment: cuda==12.1, python==3.10, and torch==2.5.1
# The Capflow mainly build upon vllm==0.7.2 with flash_attn==2.7.3
# The MetaCaptioner mainly build upon lmdeploy==0.10.1 and transformers==4.52.1

conda create -n MetaCap python=3.10 -y
conda activate MetaCap

pip install --upgrade pip
pip install -r requirements.txt

2. Capflow: General Visual Caption Pipeline

We provide a set of visual captioning pipeline Capflow with 3 main steps: dynamic domain router, visual caption pipeline and post filtering.

Use the following command after the data is prepared:

cd Capflow
# Infer domain for data (optional)
bash script/run_domain.sh
# Infer Caption Pipeline
bash ./script/run_caption.sh
# Post Filtering
bash ./script/run_filter.sh

See more details in Readme.md for data preprocessing and parameter configs.

3. MetaCaptioner

You can find the checkpoint of MetaCaptioner at MetaCaptioner-8B. After downloading, you can use the following command to run the inference file

cd MetaCaptioner
bash run.sh image 1

See more details in Readme.md.

4. Evaluation

You can use the following command to evaluate the performance of captioner.

cd VLMEvalkit
bash run.sh

🖼 Examples of Capflow and MetaCaptioner

For more examples, please refer to the appendix in our paper.

Visualization Result
math math math math

📃 License

This project is released under the MIT license.

🖊️ Citation

If you find our work helpful, please consider giving us a ⭐ and citing our paper:

@misc{lei2025metacaptionergeneralistvisualcaptioning,
      title={MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites}, 
      author={Zhenxin Lei and Zhangwei Gao and Changyao Tian and Erfei Cui and Guanzhou Chen and Danni Yang and Yuchen Duan and Zhaokai Wang and Wenhao Li and Weiyun Wang and Xiangyu Zhao and Jiayi Ji and Yu Qiao and Wenhai Wang and Gen Luo},
      year={2025},
      eprint={2510.12126},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.12126}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published