Skip to content

Commit cd1f925

Browse files
committed
update README
1 parent 3925143 commit cd1f925

File tree

7 files changed

+13
-14
lines changed

7 files changed

+13
-14
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Your browser does not support the video tag.
6363
- 2024/2/27 [MVBench](./video_chat2) is accepted by CVPR2024.
6464
- 2023/11/29 VideoChat2 and MVBench are released.
6565
- [VideoChat2](./video_chat2/) is a robust baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).
66-
- **1.9M** diverse [instruction data](./video_chat2/DATA.md) are released for effective tuning.
66+
- **2M** diverse [instruction data](./video_chat2/DATA.md) are released for effective tuning.
6767
- [MVBench](./video_chat2/MVBENCH.md) is a comprehensive benchmark for video understanding.
6868

6969
- 2023/05/11 End-to-end VideoChat and its technical report.

README_cn.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
# :fire: 更新
2828
- 2023/11/29 VideoChat2和MVBench发布
2929
- [VideoChat2](./video_chat2/)是基于[UMT](https://github.com/OpenGVLab/unmasked_teacher)[Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md)构建的强大基线
30-
- **1.9M** 多样[指令数据](./video_chat2/data.md)以便有效调优
30+
- **2M** 多样[指令数据](./video_chat2/data.md)以便有效调优
3131
- [MVBench](./video_chat2/MVBench.md)是一个全面的视频理解基准
3232

3333
- 2023/05/11 端到端VideoChat

video_chat/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ In this study, we initiate an exploration into video understanding by introducin
77
# :fire: Updates
88
- **2023/11/29** VideoChat2 and MVBench are released.
99
- [VideoChat2](../video_chat2/) is a strong baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).
10-
- **1.9M** diverse [instruction data](../video_chat2/DATA.md) are released for effective tuning.
10+
- **2M** diverse [instruction data](../video_chat2/DATA.md) are released for effective tuning.
1111
- [MVBench](../video_chat2/MVBENCH.md) is a comprehensive benchmark for video understanding.
1212

1313
- **2023/06/09**: Release code and scripts for pre-training and instruction tuning:

video_chat/README_CN.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
## 🔥 更新
1212
- **2023/11/29** VideoChat2和MVBench发布:
1313
- [VideoChat2](./video_chat2/)是基于[UMT](https://github.com/OpenGVLab/unmasked_teacher)[Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md)构建的强大基线
14-
- **1.9M** 多样[指令数据](./video_chat2/data.md)以便有效调优
14+
- **2M** 多样[指令数据](./video_chat2/data.md)以便有效调优
1515
- [MVBench](./video_chat2/MVBench.md)是一个全面的视频理解基准
1616
- **2023/06/09**: 发布代码和训练微调脚本:
1717
- 直接运行 [scripts](./scripts),比如 `bash ./exp/run_7b_stage1.sh`.

video_chat2/DATA.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
![images](./assert/data.png)
66

77
## Annotations
8-
A comprehensive dataset of **1.9M** data annotations is available in [JSON](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) format. Due to the extensive size of the full data, we provide only JSON files [here](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT). For corresponding images and videos, please follow our instructions.
8+
A comprehensive dataset of **2M** data annotations is available in [JSON](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) format. Due to the extensive size of the full data, we provide only JSON files [here](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT). For corresponding images and videos, please follow our instructions.
99

1010
## Source data
1111

video_chat2/README.md

+8-9
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,10 @@
3232
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=mvbench-a-comprehensive-multi-modal-video)
3333
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=mvbench-a-comprehensive-multi-modal-video)
3434

35-
![images](./assert/overview.png)
36-
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive **M**ulti-modal **V**ideo understanding **Bench**mark, namely **MVBench**, which covers **20** challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., **VideoChat2**, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our **VideoChat2** largely surpasses these leading models by over **15%** on MVBench.
35+
![images](./assert/mvbench_poster.jpg)
3736

3837
## :fire: Updates
39-
- **2024/05/22**: :loudspeaker: We release **VideoChat2_mistral**, which shows better capacity on diverse tasks (**60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA**). More details will be updated in the paper. Have a try! 🏃🏻‍♀️🏃🏻
38+
- **2024/05/22**: :loudspeaker: We release **VideoChat2_mistral**, which shows better capacity on diverse tasks (**60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA**). More details have been updated in the paper. Have a try! 🏃🏻‍♀️🏃🏻
4039
- **2024/04/05**: MVBench is selected as Poster (**Highlight**)! 🎉🎉
4140
- **2024/02/27**: MVBench is accepted by CVPR2024! 🎉🎉
4241
- **2023/12/17**: Online Leaderboard:
@@ -47,7 +46,7 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
4746
- :film_projector: [YouTube Video](https://www.youtube.com/watch?v=OMXlbt7A2OU&t=6s), [BiliBili Video](https://www.bilibili.com/video/BV1Qc411Q7Ud/)
4847
- **2023/11/29**: Release **VideoChat2** and **MVBench**:
4948
- [VideoChat2](https://arxiv.org/abs/2311.17005) is a robust baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).
50-
- **1.9M** diverse [instruction data](./DATA.md) are released for effective tuning.
49+
- **2M** diverse [instruction data](./DATA.md) are released for effective tuning.
5150
- [MVBench](./MVBENCH.md) is a comprehensive benchmark for video understanding.
5251

5352

@@ -59,6 +58,10 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
5958
![images](./assert/training.png)
6059
**Stage1** aligns UMT-L, the visual encoder, with QFormer to efficiently compress extensive visual inputs. **Stage2** extends this connection to incorporate LLM, while **Stage3** focuses on effective instruction tuning to enhance model performance.
6160

61+
#### [Instruction Data](./DATA.md)
62+
63+
We build a diver instruction data with **2M** samples from 34 distince sources. Check [DATA](./DATA.md) for more details.
64+
6265
#### Model
6366

6467
| | ViT | QFormer | LLM | LoRA | shell (Vicuna) | Model (Vicuna) | shell (Mistral) | Model (Mistral) |
@@ -113,10 +116,6 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
113116
> - For **IntentQA**, we report the result on validation split, and the result on testing is slighlty better (81.9\%).
114117
115118

116-
#### [Instruction Data](./DATA.md)
117-
118-
![images](./assert/data.png)
119-
120119
#### Usage
121120
- Prepare the envirment:
122121
```shell
@@ -167,8 +166,8 @@ With the rapid development of Multi-modal Large Language Models (MLLMs), a numbe
167166

168167
We propose a comprehensive video understanding benchmark with **20** challenging video tasks, where our **VideoChat2** secures the top ranking on **15** tasks. More details can be found [here](./MVBENCH.md).
169168

169+
**The online leaderboard is held in :hugs: [Hugging Face](https://huggingface.co/spaces/OpenGVLab/MVBench_Leaderboard).**
170170

171-
![images](./assert/leaderboard.png)
172171

173172
# :page_facing_up: Citation
174173

video_chat2/assert/mvbench_poster.jpg

691 KB
Loading

0 commit comments

Comments
 (0)