|
1 |
| -# VideoChat |
| 1 | +# 🦜 VideoChat [[paper]()] |
| 2 | + |
| 3 | + |
| 4 | +In this study, we initiate an exploration into video understanding by introducing VideoChat, an **end-to-end chat-centric video understanding system**. It integrates video foundation models and large language models via a learnable neural interface, excelling in **spatiotemporal reasoning, event localization, and causal relationship inference**. To instructively tune this system, we propose a **video-centric instruction dataset**, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes **spatiotemporal reasoning and causal relationships**, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system’s potential across a broad spectrum of video applications and set the standard for future research. |
2 | 5 |
|
3 |
| -VideoChat is a multifunctional video question answering tool that combines the functions of Action Recognition, Visual Captioning and ChatGPT. Our solution generates dense, descriptive captions for any object and action in a video, offering a range of language styles to suit different user preferences. It supports users to have conversations in different lengths, emotions, authenticity of language. |
4 |
| -- Video-Text Generation |
5 |
| -- Chat about uploaded video |
6 |
| -- Interactive demo |
7 | 6 |
|
8 | 7 | # :fire: Updates
|
| 8 | +- **2023/05/11**: Release the 🦜**VideoChat V1**, which can **handle both image and video understanding!** |
| 9 | + - [Model](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) and [Data](https://github.com/OpenGVLab/InternVideo/blob/main/Data/instruction_data.md). |
| 10 | + - 🧑💻 *Online demo is Preparing*. |
| 11 | + - 🧑🔧 *Tuning scripts are cleaning*. |
| 12 | + |
| 13 | +# :hourglass_flowing_sand: Schedule |
9 | 14 |
|
10 |
| -- **2023/04/19**: Code Release |
| 15 | +- [x] Small-scale video instuction data and tuning |
| 16 | +- [x] Instruction tuning on BLIP+UniFormerV2+Vicuna |
| 17 | +- [ ] Large-scale and complex video instuction data |
| 18 | +- [ ] Instruction tuning on strong video foundation model |
| 19 | +- [ ] User-friendly interactions with longer videos |
| 20 | +- [ ] ... |
11 | 21 |
|
12 | 22 | # :speech_balloon: Example
|
13 | 23 |
|
14 |
| - |
15 |
| - |
16 |
| - |
| 24 | +<div align="center"> |
| 25 | +<b> |
| 26 | + <font size="4">Comparison with ChatGPT, MiniGPT-4, LLaVA and mPLUG-Owl. </font> |
| 27 | + <br> |
| 28 | + <font size="4" color="red">Our VideoChat can handle both image and video understanding well!</font> |
| 29 | +</b> |
| 30 | +</div> |
| 31 | +<div align="center"> |
| 32 | +<img src="assert/comparison.png" width="90%"> |
| 33 | +</div> |
17 | 34 |
|
18 |
| -# :running: Usage |
| 35 | +<div align="center"> |
| 36 | + <font size="4"> |
| 37 | + <a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/jesse_dance.mp4">[Video]</a> <b>Why the video is funny?</b> |
| 38 | + </font> |
| 39 | +</div> |
| 40 | +<div align="center"> |
| 41 | +<img src="assert/humor.png" width="50%"> |
| 42 | +</div> |
| 43 | + |
| 44 | +<div align="center"> |
| 45 | + <font size="4"> |
| 46 | + <a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/jp_dance.mp4">[Video]</a> <b>Spatial perception</b> |
| 47 | + </font> |
| 48 | +</div> |
| 49 | +<div align="center"> |
| 50 | +<img src="assert/spatial.png" width="50%"> |
| 51 | +</div> |
19 | 52 |
|
20 |
| -```shell |
21 |
| -# We recommend using conda to manage the environment and use python3.8.16 |
22 |
| -conda create -n chatvideo python=3.8.16 |
23 |
| -conda activate chatvideo |
| 53 | +<div align="center"> |
| 54 | + <font size="4"> |
| 55 | + <a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/car_accident.mp4">[Video]</a> <b>Temporal perception</b> |
| 56 | + </font> |
| 57 | +</div> |
| 58 | +<div align="center"> |
| 59 | +<img src="assert/temporal.png" width="50%"> |
| 60 | +</div> |
| 61 | + |
| 62 | +<div align="center"> |
| 63 | + <font size="4"> |
| 64 | + <a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/idol_dancing.mp4">[Video]</a> <b>Multi-turn conversation</b> |
| 65 | + </font> |
| 66 | +</div> |
| 67 | +<div align="center"> |
| 68 | +<img src="assert/multi_turn.png" width="50%"> |
| 69 | +</div> |
| 70 | + |
| 71 | +<div align="center"> |
| 72 | + <font size="4"> |
| 73 | + <b>Image understanding</b> |
| 74 | + </font> |
| 75 | +</div> |
| 76 | +<div align="center"> |
| 77 | +<img src="assert/image.png" width="100%"> |
| 78 | +</div> |
| 79 | + |
| 80 | +# :running: Usage |
24 | 81 |
|
25 |
| -# Clone the repository: |
26 |
| -git clone https://github.com/OpenGVLab/Ask-Anything.git |
27 |
| -cd ask-anything/video_chat |
| 82 | +- Prepare the envirment. |
| 83 | + ```shell |
| 84 | + pip install -r requirements.txt |
| 85 | + ``` |
| 86 | + |
| 87 | +- Download [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) model: |
| 88 | + - ViT: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth` |
| 89 | + - QFormer: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth` |
| 90 | + - Change the `vit_model_path` and `q_former_model_path` in [config.json](./configs/config.json). |
| 91 | + |
| 92 | +- Download [StabelVicuna](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) model: |
| 93 | + - LLAMA: Download it from the [original repo](https://github.com/facebookresearch/llama) or [hugging face](https://huggingface.co/decapoda-research/llama-13b-hf). |
| 94 | + - If you download LLAMA from the original repo, please process it via the following command: |
| 95 | + ```shell |
| 96 | + # convert_llama_weights_to_hf is copied from transformers |
| 97 | + python src/transformers/models/llama/convert_llama_weights_to_hf.py \ |
| 98 | + --input_dir /path/to/downloaded/llama/weights \ |
| 99 | + --model_size 7B --output_dir /output/path |
| 100 | + ``` |
| 101 | + - Download [StableVicuna-13b-deelta](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) and process it: |
| 102 | + ```shell |
| 103 | + # fastchat v0.1.10 |
| 104 | + python3 apply_delta.py \ |
| 105 | + --base /path/to/model_weights/llama-13b \ |
| 106 | + --target stable-vicuna-13b \ |
| 107 | + --delta CarperAI/stable-vicuna-13b-delta |
| 108 | + ``` |
| 109 | + - Change the `llama_model_path` in [config.json](./configs/config.json). |
| 110 | + |
| 111 | +- Download [VideoChat](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) model: |
| 112 | + |
| 113 | + - Change the `ckpt` in [config.json](./configs/config.json). |
| 114 | + |
| 115 | +- Running demo with Gradio: |
| 116 | + ```shell |
| 117 | + python demo.py |
| 118 | + ``` |
| 119 | + |
| 120 | +- Another demo on Jupyter Notebook can found in [demo.ipynb](demo.ipynb) |
28 | 121 |
|
29 |
| -# Install dependencies: |
30 |
| -pip install -r requirements.txt |
31 |
| -pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz |
32 |
| -python -m pip install 'git+https://github.com/facebookresearch/detectron2.git' |
33 | 122 |
|
34 |
| -# Download the checkpoints |
35 |
| -wget https://huggingface.co/spaces/xinyu1205/Tag2Text/resolve/main/tag2text_swin_14m.pth ./pretrained_models/tag2text_swin_14m.pth |
36 |
| -wget https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth ./pretrained_models/grit_b_densecap_objectdet.pth |
37 |
| -git clone https://huggingface.co/mrm8488/flan-t5-large-finetuned-openai-summarize_from_feedback ./pretrained_models/flan-t5-large-finetuned-openai-summarize_from_feedback |
| 123 | +# :page_facing_up: Citation |
38 | 124 |
|
39 |
| -# Configure the necessary ChatGPT APIs |
40 |
| -export OPENAI_API_KEY={Your_Private_Openai_Key} |
| 125 | +If you find this project useful in your research, please consider cite: |
| 126 | +```BibTeX |
41 | 127 |
|
42 |
| -# Run the VideoChat gradio demo. |
43 |
| -python app.py |
44 | 128 | ```
|
45 | 129 |
|
46 |
| -# Acknowledgement |
| 130 | +# :thumbsup: Acknowledgement |
47 | 131 |
|
48 |
| -The project is based on [InternVideo](https://github.com/OpenGVLab/InternVideo), [Tag2Text](https://github.com/xinyu1205/Tag2Text), [GRiT](https://github.com/JialianW/GRiT), [mrm8488](https://huggingface.co/mrm8488/flan-t5-large-finetuned-openai-summarize_from_feedback) and [ChatGPT](https://openai.com/blog/chatgpt). Thanks for the authors for their efforts. |
| 132 | +Thanks to the open source of the following projects: |
49 | 133 |
|
| 134 | +[InternVideo](https://github.com/OpenGVLab/InternVideo), [UniFormerV2](https://github.com/OpenGVLab/UniFormerV2), [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA), [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2), [StableLM](https://github.com/Stability-AI/StableLM). |
0 commit comments