update README

Andy1621 · Andy1621 · commit cd1f925cff55 · 2024-05-25T16:20:16.000+08:00
diff --git a/README.md b/README.md
@@ -63,7 +63,7 @@ Your browser does not support the video tag.
 - 2024/2/27 [MVBench](./video_chat2) is accepted by CVPR2024.
 - 2023/11/29 VideoChat2 and MVBench are released.
   - [VideoChat2](./video_chat2/) is a robust baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).
-  - **1.9M** diverse [instruction data](./video_chat2/DATA.md) are released for effective tuning.
+  - **2M** diverse [instruction data](./video_chat2/DATA.md) are released for effective tuning.
   - [MVBench](./video_chat2/MVBENCH.md) is a comprehensive benchmark for video understanding.
 
 - 2023/05/11 End-to-end VideoChat and its technical report.
diff --git a/README_cn.md b/README_cn.md
@@ -27,7 +27,7 @@
 # :fire: 更新
 - 2023/11/29 VideoChat2和MVBench发布
   - [VideoChat2](./video_chat2/)是基于[UMT](https://github.com/OpenGVLab/unmasked_teacher)和[Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md)构建的强大基线
-  - **1.9M** 多样[指令数据](./video_chat2/data.md)以便有效调优
+  - **2M** 多样[指令数据](./video_chat2/data.md)以便有效调优
   - [MVBench](./video_chat2/MVBench.md)是一个全面的视频理解基准
 
 - 2023/05/11 端到端VideoChat
diff --git a/video_chat/README.md b/video_chat/README.md
@@ -7,7 +7,7 @@ In this study, we initiate an exploration into video understanding by introducin
 # :fire: Updates
 - **2023/11/29** VideoChat2 and MVBench are released.
   - [VideoChat2](../video_chat2/) is a strong baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).
-  - **1.9M** diverse [instruction data](../video_chat2/DATA.md) are released for effective tuning.
+  - **2M** diverse [instruction data](../video_chat2/DATA.md) are released for effective tuning.
   - [MVBench](../video_chat2/MVBENCH.md) is a comprehensive benchmark for video understanding.
 
 - **2023/06/09**: Release code and scripts for pre-training and instruction tuning:
diff --git a/video_chat/README_CN.md b/video_chat/README_CN.md
@@ -11,7 +11,7 @@
 ## 🔥 更新
 - **2023/11/29** VideoChat2和MVBench发布:
   - [VideoChat2](./video_chat2/)是基于[UMT](https://github.com/OpenGVLab/unmasked_teacher)和[Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md)构建的强大基线
-  - **1.9M** 多样[指令数据](./video_chat2/data.md)以便有效调优
+  - **2M** 多样[指令数据](./video_chat2/data.md)以便有效调优
   - [MVBench](./video_chat2/MVBench.md)是一个全面的视频理解基准
 - **2023/06/09**: 发布代码和训练微调脚本:
     - 直接运行 [scripts](./scripts)，比如 `bash ./exp/run_7b_stage1.sh`.
diff --git a/video_chat2/DATA.md b/video_chat2/DATA.md
@@ -5,7 +5,7 @@
 ![images](./assert/data.png)
 
 ## Annotations
-A comprehensive dataset of **1.9M** data annotations is available in [JSON](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) format. Due to the extensive size of the full data, we provide only JSON files [here](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT). For corresponding images and videos, please follow our instructions.
+A comprehensive dataset of **2M** data annotations is available in [JSON](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) format. Due to the extensive size of the full data, we provide only JSON files [here](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT). For corresponding images and videos, please follow our instructions.
 
 ## Source data
 
diff --git a/video_chat2/README.md b/video_chat2/README.md
@@ -32,11 +32,10 @@
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=mvbench-a-comprehensive-multi-modal-video)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=mvbench-a-comprehensive-multi-modal-video)
 
-![images](./assert/overview.png)
-With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive **M**ulti-modal **V**ideo understanding **Bench**mark, namely **MVBench**, which covers **20** challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., **VideoChat2**, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our **VideoChat2** largely surpasses these leading models by over **15%** on MVBench.
+![images](./assert/mvbench_poster.jpg)
 
 ## :fire: Updates
-- **2024/05/22**: :loudspeaker: We release **VideoChat2_mistral**, which shows better capacity on diverse tasks (**60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA**). More details will be updated in the paper. Have a try! 🏃🏻‍♀️🏃🏻
+- **2024/05/22**: :loudspeaker: We release **VideoChat2_mistral**, which shows better capacity on diverse tasks (**60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA**). More details have been updated in the paper. Have a try! 🏃🏻‍♀️🏃🏻
 - **2024/04/05**: MVBench is selected as Poster (**Highlight**)! 🎉🎉
 - **2024/02/27**: MVBench is accepted by CVPR2024! 🎉🎉
 - **2023/12/17**: Online Leaderboard:
@@ -47,7 +46,7 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
     - :film_projector: [YouTube Video](https://www.youtube.com/watch?v=OMXlbt7A2OU&t=6s), [BiliBili Video](https://www.bilibili.com/video/BV1Qc411Q7Ud/)
 - **2023/11/29**: Release **VideoChat2** and **MVBench**:
     - [VideoChat2](https://arxiv.org/abs/2311.17005) is a robust baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).
-    - **1.9M** diverse [instruction data](./DATA.md) are released for effective tuning.
+    - **2M** diverse [instruction data](./DATA.md) are released for effective tuning.
     - [MVBench](./MVBENCH.md) is a comprehensive benchmark for video understanding.
 
 
@@ -59,6 +58,10 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
 ![images](./assert/training.png)
 **Stage1** aligns UMT-L, the visual encoder, with QFormer to efficiently compress extensive visual inputs. **Stage2** extends this connection to incorporate LLM, while **Stage3** focuses on effective instruction tuning to enhance model performance.
 
+#### [Instruction Data](./DATA.md)
+
+We build a diver instruction data with **2M** samples from 34 distince sources. Check [DATA](./DATA.md) for more details.
+
 #### Model
 
 |        | ViT | QFormer | LLM | LoRA | shell (Vicuna) | Model (Vicuna) | shell (Mistral) | Model (Mistral) |
@@ -113,10 +116,6 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
 > - For **IntentQA**, we report the result on validation split, and the result on testing is slighlty better (81.9\%).
 
 
-#### [Instruction Data](./DATA.md)
-
-![images](./assert/data.png)
-
 #### Usage
 - Prepare the envirment:
     ```shell
@@ -167,8 +166,8 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
 
 We propose a comprehensive video understanding benchmark with **20** challenging video tasks, where our **VideoChat2** secures the top ranking on **15** tasks. More details can be found [here](./MVBENCH.md).
 
+**The online leaderboard is held in :hugs: [Hugging Face](https://huggingface.co/spaces/OpenGVLab/MVBench_Leaderboard).**
 
-![images](./assert/leaderboard.png)
 
 # :page_facing_up: Citation
 
diff --git a/video_chat2/assert/mvbench_poster.jpg b/video_chat2/assert/mvbench_poster.jpg