You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,7 @@ Your browser does not support the video tag.
63
63
- 2024/2/27 [MVBench](./video_chat2) is accepted by CVPR2024.
64
64
- 2023/11/29 VideoChat2 and MVBench are released.
65
65
-[VideoChat2](./video_chat2/) is a robust baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).
66
-
-**1.9M** diverse [instruction data](./video_chat2/DATA.md) are released for effective tuning.
66
+
-**2M** diverse [instruction data](./video_chat2/DATA.md) are released for effective tuning.
67
67
-[MVBench](./video_chat2/MVBENCH.md) is a comprehensive benchmark for video understanding.
68
68
69
69
- 2023/05/11 End-to-end VideoChat and its technical report.
Copy file name to clipboardexpand all lines: video_chat/README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ In this study, we initiate an exploration into video understanding by introducin
7
7
# :fire: Updates
8
8
-**2023/11/29** VideoChat2 and MVBench are released.
9
9
-[VideoChat2](../video_chat2/) is a strong baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).
10
-
-**1.9M** diverse [instruction data](../video_chat2/DATA.md) are released for effective tuning.
10
+
-**2M** diverse [instruction data](../video_chat2/DATA.md) are released for effective tuning.
11
11
-[MVBench](../video_chat2/MVBENCH.md) is a comprehensive benchmark for video understanding.
12
12
13
13
-**2023/06/09**: Release code and scripts for pre-training and instruction tuning:
Copy file name to clipboardexpand all lines: video_chat2/DATA.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@
5
5

6
6
7
7
## Annotations
8
-
A comprehensive dataset of **1.9M** data annotations is available in [JSON](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) format. Due to the extensive size of the full data, we provide only JSON files [here](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT). For corresponding images and videos, please follow our instructions.
8
+
A comprehensive dataset of **2M** data annotations is available in [JSON](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) format. Due to the extensive size of the full data, we provide only JSON files [here](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT). For corresponding images and videos, please follow our instructions.
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive **M**ulti-modal **V**ideo understanding **Bench**mark, namely **MVBench**, which covers **20** challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., **VideoChat2**, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our **VideoChat2** largely surpasses these leading models by over **15%** on MVBench.
35
+

37
36
38
37
## :fire: Updates
39
-
-**2024/05/22**: :loudspeaker: We release **VideoChat2_mistral**, which shows better capacity on diverse tasks (**60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA**). More details will be updated in the paper. Have a try! 🏃🏻♀️🏃🏻
38
+
-**2024/05/22**: :loudspeaker: We release **VideoChat2_mistral**, which shows better capacity on diverse tasks (**60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA**). More details have been updated in the paper. Have a try! 🏃🏻♀️🏃🏻
40
39
-**2024/04/05**: MVBench is selected as Poster (**Highlight**)! 🎉🎉
41
40
-**2024/02/27**: MVBench is accepted by CVPR2024! 🎉🎉
42
41
-**2023/12/17**: Online Leaderboard:
@@ -47,7 +46,7 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
-**2023/11/29**: Release **VideoChat2** and **MVBench**:
49
48
-[VideoChat2](https://arxiv.org/abs/2311.17005) is a robust baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).
50
-
-**1.9M** diverse [instruction data](./DATA.md) are released for effective tuning.
49
+
-**2M** diverse [instruction data](./DATA.md) are released for effective tuning.
51
50
-[MVBench](./MVBENCH.md) is a comprehensive benchmark for video understanding.
52
51
53
52
@@ -59,6 +58,10 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
59
58

60
59
**Stage1** aligns UMT-L, the visual encoder, with QFormer to efficiently compress extensive visual inputs. **Stage2** extends this connection to incorporate LLM, while **Stage3** focuses on effective instruction tuning to enhance model performance.
61
60
61
+
#### [Instruction Data](./DATA.md)
62
+
63
+
We build a diver instruction data with **2M** samples from 34 distince sources. Check [DATA](./DATA.md) for more details.
64
+
62
65
#### Model
63
66
64
67
|| ViT | QFormer | LLM | LoRA | shell (Vicuna) | Model (Vicuna) | shell (Mistral) | Model (Mistral) |
@@ -113,10 +116,6 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
113
116
> - For **IntentQA**, we report the result on validation split, and the result on testing is slighlty better (81.9\%).
114
117
115
118
116
-
#### [Instruction Data](./DATA.md)
117
-
118
-

119
-
120
119
#### Usage
121
120
- Prepare the envirment:
122
121
```shell
@@ -167,8 +166,8 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
167
166
168
167
We propose a comprehensive video understanding benchmark with **20** challenging video tasks, where our **VideoChat2** secures the top ranking on **15** tasks. More details can be found [here](./MVBENCH.md).
169
168
169
+
**The online leaderboard is held in :hugs: [Hugging Face](https://huggingface.co/spaces/OpenGVLab/MVBench_Leaderboard).**
0 commit comments