OpenGVLab
diff --git a/‎README.md
+7-4 b/‎README.md
+7-4
diff --git a/‎README_cn.md
+8-3 b/‎README_cn.md
+8-3
diff --git a/‎video_chat/README.md
+116-31 b/‎video_chat/README.md
+116-31
diff --git a/‎video_chat/assert/comparison.png
153 KB b/‎video_chat/assert/comparison.png
153 KB
diff --git a/‎video_chat/assert/framework.png
47.9 KB b/‎video_chat/assert/framework.png
47.9 KB
diff --git a/‎video_chat/assert/humor.png
94.3 KB b/‎video_chat/assert/humor.png
94.3 KB
diff --git a/‎video_chat/assert/image.png
127 KB b/‎video_chat/assert/image.png
127 KB
diff --git a/‎video_chat/assert/multi_turn.png
69.2 KB b/‎video_chat/assert/multi_turn.png
69.2 KB
diff --git a/‎video_chat/assert/spatial.png
39.1 KB b/‎video_chat/assert/spatial.png
39.1 KB
diff --git a/‎video_chat/assert/temporal.png
41.1 KB b/‎video_chat/assert/temporal.png
41.1 KB
diff --git a/‎video_chat/configs/config.json
+28 b/‎video_chat/configs/config.json
+28
@@ -33,17 +33,20 @@ Your browser does not support the video tag.
 
 
 # :fire: Updates
+- 2023/05/11 End-to-end VideoChat
+  - [VideoChat](./video_chat/): Instruction tuning for image & video chatting.
+
 - 2023/04/25 Watch videos longer than one minute with chatGPT
-  - [VideoChat_LongVideo](https://github.com/OpenGVLab/Ask-Anything/tree/long_video_support/): Update langchain and whisper to the latest version.
+  - [VideoChat LongVideo](https://github.com/OpenGVLab/Ask-Anything/tree/long_video_support/): Incorporating langchain and whisper into VideoChat.
 
 - 2023/04/21 Chat with MOSS
-  - [video_chat_with_MOSS](./video_chat_with_MOSS/): Explicit communication with MOSS. 
+  - [VideoChat with MOSS](./video_chat_with_MOSS/): Explicit communication with MOSS. 
 
 - 2023/04/20: Chat with StableLM
-  - [video_chat_with_StableLM](./video_chat_with_StableLM/): Explicit communication with StableLM. 
+  - [VideoChat with StableLM](./video_chat_with_StableLM/): Explicit communication with StableLM. 
 
 - 2023/04/19: Code release & Online Demo
-  - [VideoChat](./video_chat/): Explicit communication with ChatGPT. Sensitive with time. [demo is avaliable!](https://ask.opengvlab.com)
+  - [VideoChat with ChatGPT](./video_chat_with_ChatGPT): Explicit communication with ChatGPT. Sensitive with time. [demo is avaliable!](https://ask.opengvlab.com)
   - [MiniGPT-4 for video](./video_miniGPT4/): Implicit communication with Vicuna. Not sensitive with time. (Simple extension of [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), which will be improved in the future.)
 
 
 
@@ -21,16 +21,21 @@ https://user-images.githubusercontent.com/43169235/233814633-200df34b-7402-49b8-
 
 
 # :fire: 更新
+- 2023/05/11 端到端VideoChat
+  - [VideoChat](./video_chat/): 基于指令微调的图像视频聊天机器人
+
+- 2023/04/25 与ChatGPT一起看超过1分钟的视频
+  - [VideoChat LongVideo](https://github.com/OpenGVLab/Ask-Anything/tree/long_video_support/): 使用langchain和whisper处理长时信息
 
 - 2023/04/21 与MOSS一起看视频
   - [video_chat_with_MOSS](./video_chat_with_MOSS/): 将视频与MOSS显式编码
 
 - 2023/04/20: 与StableLM一起看视频
-  - [video_chat_with_StableLM](./video_chat_with_StableLM/): 将视频与StableLM显式编码
+  - [VideoChat with StableLM](./video_chat_with_StableLM/): 将视频与StableLM显式编码
 
 - 2023/04/19: 代码发布和在线演示Demo发布
-  - [VideoChat](./video_chat/): 将视频与ChatGPT显式编码，对时序信息敏感 [demo is avaliable!](https://ask.opengvlab.com)
-  - [MiniGPT-4 for video](./video_miniGPT4/): 将视频与Vicuna隐式编码. 对时序信息不敏感. (是[MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)的简单拓展, 将来会有所改进.)
+  - [VideoChat with ChatGPT](./video_chat_with_ChatGPT): 将视频与ChatGPT显式编码，对时序信息敏感 [demo is avaliable!](https://ask.opengvlab.com)
+  - [MiniGPT-4 for video](./video_miniGPT4/): 将视频与Vicuna隐式编码， 对时序信息不敏感。 ([MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)的简单拓展，将来会改进。)
 
 
 # :speech_balloon: 示例
 
@@ -1,49 +1,134 @@
-# VideoChat
+# 🦜 VideoChat [[paper]()]
+
+![images](assert/framework.png)
+In this study, we initiate an exploration into video understanding by introducing VideoChat, an **end-to-end chat-centric video understanding system**. It integrates video foundation models and large language models via a learnable neural interface, excelling in **spatiotemporal reasoning, event localization, and causal relationship inference**. To instructively tune this system, we propose a **video-centric instruction dataset**, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes **spatiotemporal reasoning and causal relationships**, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system’s potential across a broad spectrum of video applications and set the standard for future research.
 
-VideoChat is a multifunctional video question answering tool that combines the functions of Action Recognition, Visual Captioning and ChatGPT. Our solution generates dense, descriptive captions for any object and action in a video, offering a range of language styles to suit different user preferences. It supports users to have conversations in different lengths, emotions, authenticity of language.
-- Video-Text Generation
-- Chat about uploaded video
-- Interactive demo
 
 # :fire: Updates
+- **2023/05/11**: Release the 🦜**VideoChat V1**, which can **handle both image and video understanding!**
+    - [Model](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) and [Data](https://github.com/OpenGVLab/InternVideo/blob/main/Data/instruction_data.md).
+    - 🧑‍💻 *Online demo is Preparing*.
+    - 🧑‍🔧 *Tuning scripts are cleaning*.
+
+# :hourglass_flowing_sand: Schedule
 
-- **2023/04/19**: Code Release
+- [x] Small-scale video instuction data and tuning
+- [x] Instruction tuning on BLIP+UniFormerV2+Vicuna
+- [ ] Large-scale and complex video instuction data
+- [ ] Instruction tuning on strong video foundation model
+- [ ] User-friendly interactions with longer videos
+- [ ] ...
 
 # :speech_balloon: Example
 
-![images](assert/hugging.png)
-![images](assert/dancing.png)
-![images](assert/dancing2.png)
+<div align="center">
+<b>
+  <font size="4">Comparison with ChatGPT, MiniGPT-4, LLaVA and mPLUG-Owl. </font>
+  <br>
+  <font size="4" color="red">Our VideoChat can handle both image and video understanding well!</font>
+</b>
+</div>
+<div align="center">
+<img src="assert/comparison.png" width="90%">
+</div>
 
-# :running: Usage
+<div align="center">
+  <font size="4">
+	<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/jesse_dance.mp4">[Video]</a> <b>Why the video is funny?</b>
+  </font>
+</div>
+<div align="center">
+<img src="assert/humor.png" width="50%">
+</div>
+
+<div align="center">
+  <font size="4">
+	<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/jp_dance.mp4">[Video]</a> <b>Spatial perception</b>
+  </font>
+</div>
+<div align="center">
+<img src="assert/spatial.png" width="50%">
+</div>
 
-```shell
-# We recommend using conda to manage the environment and use python3.8.16
-conda create -n chatvideo python=3.8.16
-conda activate chatvideo
+<div align="center">
+  <font size="4">
+	<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/car_accident.mp4">[Video]</a> <b>Temporal perception</b>
+  </font>
+</div>
+<div align="center">
+<img src="assert/temporal.png" width="50%">
+</div>
+
+<div align="center">
+  <font size="4">
+	<a href="https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/papers/media/idol_dancing.mp4">[Video]</a> <b>Multi-turn conversation</b>
+  </font>
+</div>
+<div align="center">
+<img src="assert/multi_turn.png" width="50%">
+</div>
+
+<div align="center">
+  <font size="4">
+	<b>Image understanding</b>
+  </font>
+</div>
+<div align="center">
+<img src="assert/image.png" width="100%">
+</div>
+
+# :running: Usage
 
-# Clone the repository:
-git clone https://github.com/OpenGVLab/Ask-Anything.git
-cd ask-anything/video_chat
+- Prepare the envirment.
+    ```shell
+    pip install -r requirements.txt
+    ```
+    
+- Download [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) model:
+    - ViT: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth`
+    - QFormer: `wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth`
+    - Change the `vit_model_path` and `q_former_model_path` in [config.json](./configs/config.json).
+    
+- Download [StabelVicuna](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) model:
+    - LLAMA: Download it from the [original repo](https://github.com/facebookresearch/llama) or [hugging face](https://huggingface.co/decapoda-research/llama-13b-hf).
+    - If you download LLAMA from the original repo, please process it via the following command:
+    ```shell
+    # convert_llama_weights_to_hf is copied from transformers
+    python src/transformers/models/llama/convert_llama_weights_to_hf.py \
+    	--input_dir /path/to/downloaded/llama/weights \
+    	--model_size 7B --output_dir /output/path
+    ```
+    - Download [StableVicuna-13b-deelta](https://huggingface.co/CarperAI/stable-vicuna-13b-delta) and process it:
+    ```shell
+    # fastchat v0.1.10
+    python3 apply_delta.py \
+      --base /path/to/model_weights/llama-13b \
+      --target stable-vicuna-13b \
+      --delta CarperAI/stable-vicuna-13b-delta
+    ```
+    - Change the `llama_model_path` in [config.json](./configs/config.json).
+    
+- Download [VideoChat](https://drive.google.com/file/d/1BqmWHWCZBPkhTNWDAq0IfGpbkKLz9C0V/view?usp=share_link) model:
+  
+    - Change the `ckpt` in [config.json](./configs/config.json).
+    
+- Running demo with Gradio:
+    ```shell
+    python demo.py
+    ```
+    
+- Another demo on Jupyter Notebook can found in [demo.ipynb](demo.ipynb)
 
-# Install dependencies:
-pip install -r requirements.txt
-pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
-python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
 
-# Download the checkpoints
-wget https://huggingface.co/spaces/xinyu1205/Tag2Text/resolve/main/tag2text_swin_14m.pth ./pretrained_models/tag2text_swin_14m.pth
-wget https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth ./pretrained_models/grit_b_densecap_objectdet.pth
-git clone https://huggingface.co/mrm8488/flan-t5-large-finetuned-openai-summarize_from_feedback ./pretrained_models/flan-t5-large-finetuned-openai-summarize_from_feedback
+# :page_facing_up: Citation
 
-# Configure the necessary ChatGPT APIs
-export OPENAI_API_KEY={Your_Private_Openai_Key}
+If you find this project useful in your research, please consider cite:
+```BibTeX
 
-# Run the VideoChat gradio demo.
-python app.py
 ```
 
-# Acknowledgement
+# :thumbsup: Acknowledgement
 
-The project is based on [InternVideo](https://github.com/OpenGVLab/InternVideo), [Tag2Text](https://github.com/xinyu1205/Tag2Text), [GRiT](https://github.com/JialianW/GRiT), [mrm8488](https://huggingface.co/mrm8488/flan-t5-large-finetuned-openai-summarize_from_feedback) and [ChatGPT](https://openai.com/blog/chatgpt). Thanks for the authors for their efforts.
+Thanks to the open source of the following projects:
 
+[InternVideo](https://github.com/OpenGVLab/InternVideo), [UniFormerV2](https://github.com/OpenGVLab/UniFormerV2), [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA), [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2), [StableLM](https://github.com/Stability-AI/StableLM).
@@ -0,0 +1,28 @@
+{
+  "model": {
+    "vit_model": "eva_clip_g",
+    "vit_model_path": "model/eva_vit_g.pth",
+    "q_former_model_path": "model/blip2_pretrained_flant5xxl.pth",
+    "llama_model_path": "model/stable-vicuna-13b",
+    "videochat_model_path": "modelvideochat.pth",
+    "img_size": 224,
+    "num_query_token": 32,
+    "drop_path_rate": 0.0,
+    "use_grad_checkpoint": false,
+    "vit_precision": "fp32",
+    "freeze_vit": true,
+    "freeze_mhra": false,
+    "freeze_qformer": true,
+    "low_resource": false,
+    "max_txt_len": 320,
+    "temporal_downsample": false,
+    "no_lmhra": true,
+    "double_lmhra": false,
+    "lmhra_reduction": 2.0,
+    "gmhra_layers": 8,
+    "gmhra_drop_path_rate": 0.0,
+    "gmhra_dropout": 0.5,
+    "extra_num_query_token": 64
+  },
+  "device": "cuda"
+}