Skip to content

Commit 2e83d6e

Browse files
committed
add videochat
1 parent 1da7753 commit 2e83d6e

File tree

668 files changed

+4435
-72
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

668 files changed

+4435
-72
lines changed

README.md

+7-4
Original file line numberDiff line numberDiff line change
@@ -33,17 +33,20 @@ Your browser does not support the video tag.
3333

3434

3535
# :fire: Updates
36+
- 2023/05/11 End-to-end VideoChat
37+
- [VideoChat](./video_chat/): Instruction tuning for image & video chatting.
38+
3639
- 2023/04/25 Watch videos longer than one minute with chatGPT
37-
- [VideoChat_LongVideo](https://github.com/OpenGVLab/Ask-Anything/tree/long_video_support/): Update langchain and whisper to the latest version.
40+
- [VideoChat LongVideo](https://github.com/OpenGVLab/Ask-Anything/tree/long_video_support/): Incorporating langchain and whisper into VideoChat.
3841

3942
- 2023/04/21 Chat with MOSS
40-
- [video_chat_with_MOSS](./video_chat_with_MOSS/): Explicit communication with MOSS.
43+
- [VideoChat with MOSS](./video_chat_with_MOSS/): Explicit communication with MOSS.
4144

4245
- 2023/04/20: Chat with StableLM
43-
- [video_chat_with_StableLM](./video_chat_with_StableLM/): Explicit communication with StableLM.
46+
- [VideoChat with StableLM](./video_chat_with_StableLM/): Explicit communication with StableLM.
4447

4548
- 2023/04/19: Code release & Online Demo
46-
- [VideoChat](./video_chat/): Explicit communication with ChatGPT. Sensitive with time. [demo is avaliable!](https://ask.opengvlab.com)
49+
- [VideoChat with ChatGPT](./video_chat_with_ChatGPT): Explicit communication with ChatGPT. Sensitive with time. [demo is avaliable!](https://ask.opengvlab.com)
4750
- [MiniGPT-4 for video](./video_miniGPT4/): Implicit communication with Vicuna. Not sensitive with time. (Simple extension of [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), which will be improved in the future.)
4851

4952

README_cn.md

+8-3
Original file line numberDiff line numberDiff line change
@@ -21,16 +21,21 @@ https://user-images.githubusercontent.com/43169235/233814633-200df34b-7402-49b8-
2121

2222

2323
# :fire: 更新
24+
- 2023/05/11 端到端VideoChat
25+
- [VideoChat](./video_chat/): 基于指令微调的图像视频聊天机器人
26+
27+
- 2023/04/25 与ChatGPT一起看超过1分钟的视频
28+
- [VideoChat LongVideo](https://github.com/OpenGVLab/Ask-Anything/tree/long_video_support/): 使用langchain和whisper处理长时信息
2429

2530
- 2023/04/21 与MOSS一起看视频
2631
- [video_chat_with_MOSS](./video_chat_with_MOSS/): 将视频与MOSS显式编码
2732

2833
- 2023/04/20: 与StableLM一起看视频
29-
- [video_chat_with_StableLM](./video_chat_with_StableLM/): 将视频与StableLM显式编码
34+
- [VideoChat with StableLM](./video_chat_with_StableLM/): 将视频与StableLM显式编码
3035

3136
- 2023/04/19: 代码发布和在线演示Demo发布
32-
- [VideoChat](./video_chat/): 将视频与ChatGPT显式编码,对时序信息敏感 [demo is avaliable!](https://ask.opengvlab.com)
33-
- [MiniGPT-4 for video](./video_miniGPT4/): 将视频与Vicuna隐式编码. 对时序信息不敏感. (是[MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)的简单拓展, 将来会有所改进.)
37+
- [VideoChat with ChatGPT](./video_chat_with_ChatGPT): 将视频与ChatGPT显式编码,对时序信息敏感 [demo is avaliable!](https://ask.opengvlab.com)
38+
- [MiniGPT-4 for video](./video_miniGPT4/): 将视频与Vicuna隐式编码 对时序信息不敏感。 ([MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)的简单拓展,将来会改进。)
3439

3540

3641
# :speech_balloon: 示例

video_chat/README.md

+116-31
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,134 @@
1-
# VideoChat
1+
# 🦜 VideoChat [[paper]()]
2+
3+
![images](assert/framework.png)
4+
In this study, we initiate an exploration into video understanding by introducing VideoChat, an **end-to-end chat-centric video understanding system**. It integrates video foundation models and large language models via a learnable neural interface, excelling in **spatiotemporal reasoning, event localization, and causal relationship inference**. To instructively tune this system, we propose a **video-centric instruction dataset**, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes **spatiotemporal reasoning and causal relationships**, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system’s potential across a broad spectrum of video applications and set the standard for future research.
25

3-
VideoChat is a multifunctional video question answering tool that combines the functions of Action Recognition, Visual Captioning and ChatGPT. Our solution generates dense, descriptive captions for any object and action in a video, offering a range of language styles to suit different user preferences. It supports users to have conversations in different lengths, emotions, authenticity of language.
4-
- Video-Text Generation
5-
- Chat about uploaded video
6-
- Interactive demo
76

87
# :fire: Updates
8+
- **2023/05/11**: Release the 🦜**VideoChat V1**, which can **handle both image and video understanding!**
9+
- [Model](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) and [Data](https://github.com/OpenGVLab/InternVideo/blob/main/Data/instruction_data.md).
10+
- 🧑‍💻 *Online demo is Preparing*.
11+
- 🧑‍🔧 *Tuning scripts are cleaning*.
12+
13+
# :hourglass_flowing_sand: Schedule
914

10-
- **2023/04/19**: Code Release
15+
- [x] Small-scale video instuction data and tuning
16+
- [x] Instruction tuning on BLIP+UniFormerV2+Vicuna
17+
- [ ] Large-scale and complex video instuction data
18+
- [ ] Instruction tuning on strong video foundation model
19+
- [ ] User-friendly interactions with longer videos
20+
- [ ] ...
1121

1222
# :speech_balloon: Example
1323

14-
![images](assert/hugging.png)
15-
![images](assert/dancing.png)
16-
![images](assert/dancing2.png)
24+
<div align="center">
25+
<b>
26+
<font size="4">Comparison with ChatGPT, MiniGPT-4, LLaVA and mPLUG-Owl. </font>
27+
<br>
28+
<font size="4" color="red">Our VideoChat can handle both image and video understanding well!</font>
29+
</b>
30+
</div>
31+
<div align="center">
32+
<img src="assert/comparison.png" width="90%">
33+
</div>
1734

18-
# :running: Usage
35+
<div align="center">
36+
<font size="4">
37+
<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/jesse_dance.mp4">[Video]</a> <b>Why the video is funny?</b>
38+
</font>
39+
</div>
40+
<div align="center">
41+
<img src="assert/humor.png" width="50%">
42+
</div>
43+
44+
<div align="center">
45+
<font size="4">
46+
<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/jp_dance.mp4">[Video]</a> <b>Spatial perception</b>
47+
</font>
48+
</div>
49+
<div align="center">
50+
<img src="assert/spatial.png" width="50%">
51+
</div>
1952

20-
```shell
21-
# We recommend using conda to manage the environment and use python3.8.16
22-
conda create -n chatvideo python=3.8.16
23-
conda activate chatvideo
53+
<div align="center">
54+
<font size="4">
55+
<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/car_accident.mp4">[Video]</a> <b>Temporal perception</b>
56+
</font>
57+
</div>
58+
<div align="center">
59+
<img src="assert/temporal.png" width="50%">
60+
</div>
61+
62+
<div align="center">
63+
<font size="4">
64+
<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/idol_dancing.mp4">[Video]</a> <b>Multi-turn conversation</b>
65+
</font>
66+
</div>
67+
<div align="center">
68+
<img src="assert/multi_turn.png" width="50%">
69+
</div>
70+
71+
<div align="center">
72+
<font size="4">
73+
<b>Image understanding</b>
74+
</font>
75+
</div>
76+
<div align="center">
77+
<img src="assert/image.png" width="100%">
78+
</div>
79+
80+
# :running: Usage
2481

25-
# Clone the repository:
26-
git clone https://github.com/OpenGVLab/Ask-Anything.git
27-
cd ask-anything/video_chat
82+
- Prepare the envirment.
83+
```shell
84+
pip install -r requirements.txt
85+
```
86+
87+
- Download [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) model:
88+
- ViT: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth`
89+
- QFormer: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth`
90+
- Change the `vit_model_path` and `q_former_model_path` in [config.json](./configs/config.json).
91+
92+
- Download [StabelVicuna](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) model:
93+
- LLAMA: Download it from the [original repo](https://github.com/facebookresearch/llama) or [hugging face](https://huggingface.co/decapoda-research/llama-13b-hf).
94+
- If you download LLAMA from the original repo, please process it via the following command:
95+
```shell
96+
# convert_llama_weights_to_hf is copied from transformers
97+
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
98+
--input_dir /path/to/downloaded/llama/weights \
99+
--model_size 7B --output_dir /output/path
100+
```
101+
- Download [StableVicuna-13b-deelta](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) and process it:
102+
```shell
103+
# fastchat v0.1.10
104+
python3 apply_delta.py \
105+
--base /path/to/model_weights/llama-13b \
106+
--target stable-vicuna-13b \
107+
--delta CarperAI/stable-vicuna-13b-delta
108+
```
109+
- Change the `llama_model_path` in [config.json](./configs/config.json).
110+
111+
- Download [VideoChat](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) model:
112+
113+
- Change the `ckpt` in [config.json](./configs/config.json).
114+
115+
- Running demo with Gradio:
116+
```shell
117+
python demo.py
118+
```
119+
120+
- Another demo on Jupyter Notebook can found in [demo.ipynb](demo.ipynb)
28121

29-
# Install dependencies:
30-
pip install -r requirements.txt
31-
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
32-
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
33122

34-
# Download the checkpoints
35-
wget https://huggingface.co/spaces/xinyu1205/Tag2Text/resolve/main/tag2text_swin_14m.pth ./pretrained_models/tag2text_swin_14m.pth
36-
wget https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth ./pretrained_models/grit_b_densecap_objectdet.pth
37-
git clone https://huggingface.co/mrm8488/flan-t5-large-finetuned-openai-summarize_from_feedback ./pretrained_models/flan-t5-large-finetuned-openai-summarize_from_feedback
123+
# :page_facing_up: Citation
38124

39-
# Configure the necessary ChatGPT APIs
40-
export OPENAI_API_KEY={Your_Private_Openai_Key}
125+
If you find this project useful in your research, please consider cite:
126+
```BibTeX
41127
42-
# Run the VideoChat gradio demo.
43-
python app.py
44128
```
45129

46-
# Acknowledgement
130+
# :thumbsup: Acknowledgement
47131

48-
The project is based on [InternVideo](https://github.com/OpenGVLab/InternVideo), [Tag2Text](https://github.com/xinyu1205/Tag2Text), [GRiT](https://github.com/JialianW/GRiT), [mrm8488](https://huggingface.co/mrm8488/flan-t5-large-finetuned-openai-summarize_from_feedback) and [ChatGPT](https://openai.com/blog/chatgpt). Thanks for the authors for their efforts.
132+
Thanks to the open source of the following projects:
49133

134+
[InternVideo](https://github.com/OpenGVLab/InternVideo), [UniFormerV2](https://github.com/OpenGVLab/UniFormerV2), [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA), [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2), [StableLM](https://github.com/Stability-AI/StableLM).

video_chat/assert/comparison.png

153 KB
Loading

video_chat/assert/framework.png

47.9 KB
Loading

video_chat/assert/humor.png

94.3 KB
Loading

video_chat/assert/image.png

127 KB
Loading

video_chat/assert/multi_turn.png

69.2 KB
Loading

video_chat/assert/spatial.png

39.1 KB
Loading

video_chat/assert/temporal.png

41.1 KB
Loading

video_chat/configs/config.json

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"model": {
3+
"vit_model": "eva_clip_g",
4+
"vit_model_path": "model/eva_vit_g.pth",
5+
"q_former_model_path": "model/blip2_pretrained_flant5xxl.pth",
6+
"llama_model_path": "model/stable-vicuna-13b",
7+
"videochat_model_path": "modelvideochat.pth",
8+
"img_size": 224,
9+
"num_query_token": 32,
10+
"drop_path_rate": 0.0,
11+
"use_grad_checkpoint": false,
12+
"vit_precision": "fp32",
13+
"freeze_vit": true,
14+
"freeze_mhra": false,
15+
"freeze_qformer": true,
16+
"low_resource": false,
17+
"max_txt_len": 320,
18+
"temporal_downsample": false,
19+
"no_lmhra": true,
20+
"double_lmhra": false,
21+
"lmhra_reduction": 2.0,
22+
"gmhra_layers": 8,
23+
"gmhra_drop_path_rate": 0.0,
24+
"gmhra_dropout": 0.5,
25+
"extra_num_query_token": 64
26+
},
27+
"device": "cuda"
28+
}

0 commit comments

Comments
 (0)