Skip to content

Commit 75e1a02

Browse files
committed
init
1 parent 641d4e4 commit 75e1a02

File tree

637 files changed

+86748
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

637 files changed

+86748
-1
lines changed

.gitignore

+6
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,9 @@ dmypy.json
127127

128128
# Pyre type checker
129129
.pyre/
130+
131+
# macOS
132+
.DS_Store
133+
134+
# personal
135+
demo.ipynb

README.md

+29-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,30 @@
11
# Ask-Anything
2-
a simple yet interesting tool for chatting with video
2+
3+
Currently, Ask-Anything is a simple yet interesting tool for chatting with video.
4+
Our tean is trying to build smart and robust ChatBot for video understanding now.
5+
6+
7+
# :fire: Updates
8+
- 2023/04/19: Code release
9+
- [VideoChat](./video_chat/): Explicit communication with ChatGPT. Sensitive with time.
10+
- [MiniGPT-4 for video](./video_miniGPT4/): Implicit communication with Vicuna. Not sensitive with time. (Simple extension of [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), which will be improved in the future.)
11+
12+
13+
# :speech_balloon: Example
14+
15+
16+
17+
18+
# :hourglass_flowing_sand: Ongoing
19+
20+
Our team is mainly focus on general video understanding and long-term video reasoning:
21+
22+
- [ ] Strong video foundation model.
23+
- [ ] Large-scale and high-quality video-text dataset.
24+
- [ ] Large-scale long-term video reasoning benchmark.
25+
- [ ] Short-term video-language system with LLMs.
26+
- [ ] Long-term video-language system with LLMs.
27+
- [ ] Artificial Intelligence Generated Content(AIGC) for Video.
28+
- [ ] ...
29+
30+
We are hiring researchers, engineers and interns in **General Vision Group, Shanghai AI Lab**. If you are interested in working with us, please contact [Yi Wang](https://shepnerd.github.io/) (`[email protected]`).

example/hitting_baseball.mp4

671 KB
Binary file not shown.

example/yoga.mp4

758 KB
Binary file not shown.

video_chat/README.md

+43
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# VideoChat
2+
3+
VideoChat is a multifunctional video question answering tool that combines the functions of Action Recognition, Visual Captioning and ChatGPT. Our solution generates dense, descriptive captions for any object and action in a video, offering a range of language styles to suit different user preferences. It supports users to have conversations in different lengths, emotions, authenticity of language.
4+
- Video-Text Generation
5+
- Chat about uploaded video
6+
- Interactive demo
7+
8+
# :fire: Updates
9+
10+
- **2023/04/19**: Code Release
11+
12+
# :speech_balloon: Example
13+
14+
![images](video_chat/assert/hugging.png)
15+
![images](video_chat/assert/dancing.png)
16+
![images](video_chat/assert/dancing2.png)
17+
18+
# :running: Usage
19+
20+
```shell
21+
# Clone the repository:
22+
git clone ask-anything.git
23+
cd ask-anything/video_chat
24+
25+
# Install dependencies:
26+
pip install -r requirements.txt
27+
28+
# Download the checkpoints
29+
wget https://huggingface.co/spaces/xinyu1205/Tag2Text/resolve/main/tag2text_swin_14m.pth ./pretrained_models/tag2text_swin_14m.pth
30+
wget wget https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth ./pretrained_models/grit_b_densecap_objectdet.pth
31+
git clone https://huggingface.co/mrm8488/flan-t5-large-finetuned-openai-summarize_from_feedback pretrained_models/flan-t5-large-finetuned-openai-summarize_from_feedback
32+
33+
# Configure the necessary ChatGPT APIs
34+
export OPENAI_API_KEY={Your_Private_Openai_Key}
35+
36+
# Run the VideoChat gradio demo.
37+
python app.py
38+
```
39+
40+
# Acknowledgement
41+
42+
The project is based on [InternVideo](https://github.com/OpenGVLab/InternVideo), [Tag2Text](https://github.com/xinyu1205/Tag2Text), [GRiT](https://github.com/JialianW/GRiT), [mrm8488](https://huggingface.co/mrm8488/flan-t5-large-finetuned-openai-summarize_from_feedback) and [ChatGPT](https://openai.com/blog/chatgpt). Thanks for the authors for their efforts.
43+

video_chat/app.py

+160
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
import os
2+
import numpy as np
3+
import random
4+
import torch
5+
import torchvision.transforms as transforms
6+
from PIL import Image
7+
from models.tag2text import tag2text_caption
8+
from util import *
9+
import gradio as gr
10+
from chatbot import *
11+
from load_internvideo import *
12+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
13+
from simplet5 import SimpleT5
14+
from models.grit_model import DenseCaptioning
15+
bot = ConversationBot()
16+
image_size = 384
17+
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
18+
std=[0.229, 0.224, 0.225])
19+
transform = transforms.Compose([transforms.ToPILImage(),transforms.Resize((image_size, image_size)),transforms.ToTensor(),normalize])
20+
21+
22+
# define model
23+
model = tag2text_caption(pretrained="pretrained_models/tag2text_swin_14m.pth", image_size=image_size, vit='swin_b' )
24+
model.eval()
25+
model = model.to(device)
26+
print("[INFO] initialize caption model success!")
27+
28+
model_T5 = SimpleT5()
29+
if torch.cuda.is_available():
30+
model_T5.load_model(
31+
"t5", "./pretrained_models/flan-t5-large-finetuned-openai-summarize_from_feedback", use_gpu=True)
32+
else:
33+
model_T5.load_model(
34+
"t5", "./pretrained_models/flan-t5-large-finetuned-openai-summarize_from_feedback", use_gpu=False)
35+
print("[INFO] initialize summarize model success!")
36+
# action recognition
37+
intern_action = load_intern_action(device)
38+
trans_action = transform_action()
39+
topil = T.ToPILImage()
40+
print("[INFO] initialize InternVideo model success!")
41+
42+
dense_caption_model = DenseCaptioning(device)
43+
dense_caption_model.initialize_model()
44+
print("[INFO] initialize dense caption model success!")
45+
46+
def inference(video_path, input_tag, progress=gr.Progress()):
47+
data = loadvideo_decord_origin(video_path)
48+
progress(0.2, desc="Loading Videos")
49+
50+
# InternVideo
51+
action_index = np.linspace(0, len(data)-1, 8).astype(int)
52+
tmp,tmpa = [],[]
53+
for i,img in enumerate(data):
54+
tmp.append(transform(img).to(device).unsqueeze(0))
55+
if i in action_index:
56+
tmpa.append(topil(img))
57+
action_tensor = trans_action(tmpa)
58+
TC, H, W = action_tensor.shape
59+
action_tensor = action_tensor.reshape(1, TC//3, 3, H, W).permute(0, 2, 1, 3, 4).to(device)
60+
prediction = intern_action(action_tensor)
61+
prediction = F.softmax(prediction, dim=1).flatten()
62+
prediction = kinetics_classnames[str(int(prediction.argmax()))]
63+
64+
# dense caption
65+
dense_caption = []
66+
dense_index = np.arange(0, len(data)-1, 5)
67+
original_images = data[dense_index,:,:,::-1]
68+
for original_image in original_images:
69+
dense_caption.append(dense_caption_model.run_caption_tensor(original_image))
70+
dense_caption = ' '.join([f"Second {i+1} : {j}.\n" for i,j in zip(dense_index,dense_caption)])
71+
72+
# Video Caption
73+
image = torch.cat(tmp).to(device)
74+
75+
model.threshold = 0.68
76+
if input_tag == '' or input_tag == 'none' or input_tag == 'None':
77+
input_tag_list = None
78+
else:
79+
input_tag_list = []
80+
input_tag_list.append(input_tag.replace(',',' | '))
81+
with torch.no_grad():
82+
caption, tag_predict = model.generate(image,tag_input = input_tag_list,max_length = 50, return_tag_predict = True)
83+
progress(0.6, desc="Watching Videos")
84+
frame_caption = ' '.join([f"Second {i+1}:{j}.\n" for i,j in enumerate(caption)])
85+
if input_tag_list == None:
86+
tag_1 = set(tag_predict)
87+
tag_2 = ['none']
88+
else:
89+
_, tag_1 = model.generate(image,tag_input = None, max_length = 50, return_tag_predict = True)
90+
tag_2 = set(tag_predict)
91+
progress(0.8, desc="Understanding Videos")
92+
synth_caption = model_T5.predict('. '.join(caption))
93+
print(frame_caption, dense_caption, synth_caption)
94+
return ' | '.join(tag_1),' | '.join(tag_2), frame_caption, dense_caption, synth_caption[0], gr.update(interactive = True), prediction
95+
96+
97+
98+
with gr.Blocks(css="#chatbot {overflow:auto; height:500px;}") as demo:
99+
gr.Markdown("<h1><center>Ask Anything with GPT</center></h1>")
100+
gr.Markdown(
101+
"""
102+
Ask-Anything is a multifunctional video question answering tool that combines the functions of Action Recognition, Visual Captioning and ChatGPT. Our solution generates dense, descriptive captions for any object and action in a video, offering a range of language styles to suit different user preferences. It supports users to have conversations in different lengths, emotions, authenticity of language.<br>
103+
"""
104+
)
105+
106+
with gr.Row():
107+
with gr.Column():
108+
input_video_path = gr.inputs.Video(label="Input Video")
109+
input_tag = gr.Textbox(lines=1, label="User Prompt (Optional, Enter with commas)",visible=False)
110+
111+
with gr.Row():
112+
with gr.Column(sclae=0.3, min_width=0):
113+
caption = gr.Button("✍ Upload")
114+
chat_video = gr.Button(" 🎥 Let's Chat! ", interactive=False)
115+
with gr.Column(scale=0.7, min_width=0):
116+
loadinglabel = gr.Label(label="State")
117+
with gr.Column():
118+
openai_api_key_textbox = gr.Textbox(
119+
value=os.environ["OPENAI_API_KEY"],
120+
placeholder="Paste your OpenAI API key here to start (sk-...)",
121+
show_label=False,
122+
lines=1,
123+
type="password",
124+
)
125+
chatbot = gr.Chatbot(elem_id="chatbot", label="gpt")
126+
state = gr.State([])
127+
user_tag_output = gr.State("")
128+
image_caption_output = gr.State("")
129+
video_caption_output = gr.State("")
130+
model_tag_output = gr.State("")
131+
dense_caption_output = gr.State("")
132+
with gr.Row(visible=False) as input_raws:
133+
with gr.Column(scale=0.8):
134+
txt = gr.Textbox(show_label=False, placeholder="Enter text and press enter").style(container=False)
135+
with gr.Column(scale=0.10, min_width=0):
136+
run = gr.Button("🏃‍♂️Run")
137+
with gr.Column(scale=0.10, min_width=0):
138+
clear = gr.Button("🔄Clear️")
139+
140+
141+
caption.click(bot.memory.clear)
142+
caption.click(lambda: gr.update(interactive = False), None, chat_video)
143+
caption.click(lambda: [], None, chatbot)
144+
caption.click(lambda: [], None, state)
145+
caption.click(inference,[input_video_path,input_tag],[model_tag_output, user_tag_output, image_caption_output, dense_caption_output,video_caption_output, chat_video, loadinglabel])
146+
147+
chat_video.click(bot.init_agent, [openai_api_key_textbox, image_caption_output, dense_caption_output, video_caption_output, model_tag_output, state], [input_raws,chatbot, state])
148+
149+
txt.submit(bot.run_text, [txt, state], [chatbot, state])
150+
txt.submit(lambda: "", None, txt)
151+
run.click(bot.run_text, [txt, state], [chatbot, state])
152+
run.click(lambda: "", None, txt)
153+
154+
clear.click(bot.memory.clear)
155+
clear.click(lambda: [], None, chatbot)
156+
clear.click(lambda: [], None, state)
157+
158+
159+
160+
demo.launch(server_name="0.0.0.0",enable_queue=True,)#share=True)

video_chat/assert/dancing.png

502 KB
Loading

video_chat/assert/dancing2.png

503 KB
Loading

video_chat/assert/hugging.png

1.13 MB
Loading

video_chat/chatbot.py

+84
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
from langchain.agents.initialize import initialize_agent
2+
from langchain.agents.tools import Tool
3+
from langchain.chains.conversation.memory import ConversationBufferMemory
4+
from langchain.llms.openai import OpenAI
5+
import re
6+
import gradio as gr
7+
import openai
8+
9+
10+
def cut_dialogue_history(history_memory, keep_last_n_words=400):
11+
if history_memory is None or len(history_memory) == 0:
12+
return history_memory
13+
tokens = history_memory.split()
14+
n_tokens = len(tokens)
15+
print(f"history_memory:{history_memory}, n_tokens: {n_tokens}")
16+
if n_tokens < keep_last_n_words:
17+
return history_memory
18+
paragraphs = history_memory.split('\n')
19+
last_n_tokens = n_tokens
20+
while last_n_tokens >= keep_last_n_words:
21+
last_n_tokens -= len(paragraphs[0].split(' '))
22+
paragraphs = paragraphs[1:]
23+
return '\n' + '\n'.join(paragraphs)
24+
25+
26+
class ConversationBot:
27+
def __init__(self):
28+
self.memory = ConversationBufferMemory(memory_key="chat_history", output_key='output')
29+
self.tools = []
30+
31+
def run_text(self, text, state):
32+
self.agent.memory.buffer = cut_dialogue_history(self.agent.memory.buffer, keep_last_n_words=500)
33+
res = self.agent({"input": text.strip()})
34+
res['output'] = res['output'].replace("\\", "/")
35+
response = res['output']
36+
state = state + [(text, response)]
37+
print(f"\nProcessed run_text, Input text: {text}\nCurrent state: {state}\n"
38+
f"Current Memory: {self.agent.memory.buffer}")
39+
return state, state
40+
41+
42+
def init_agent(self, openai_api_key, image_caption, dense_caption, video_caption, tags, state):
43+
chat_history =''
44+
PREFIX = "ChatVideo is a chatbot that chats with you based on video descriptions."
45+
FORMAT_INSTRUCTIONS = """
46+
When you have a response to say to the Human, you MUST use the format:
47+
```
48+
{ai_prefix}: [your response here]
49+
```
50+
"""
51+
SUFFIX = f"""You are a chatbot that conducts conversations based on video descriptions. You mainly answer based on the given description, and you can also modify the content according to the tag information, and you can also answer the relevant knowledge of the person or object contained in the video. The second description is a description for one second, so that you can convert it into time. When describing, please mainly refer to the sceond description. Dense caption is to give content every five seconds, you can disambiguate them in timing. But you don't create a video plot out of nothing.
52+
53+
Begin!
54+
55+
Video tags are: {tags}
56+
57+
The second description of the video is: {image_caption}
58+
59+
The dense caption of the video is: {dense_caption}
60+
61+
The general description of the video is: {video_caption}"""+"""Previous conversation history {chat_history}
62+
63+
New input: {input}
64+
65+
{agent_scratchpad}
66+
"""
67+
self.memory.clear()
68+
69+
self.llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
70+
71+
self.agent = initialize_agent(
72+
self.tools,
73+
self.llm,
74+
agent="conversational-react-description",
75+
verbose=True,
76+
memory=self.memory,
77+
return_intermediate_steps=True,
78+
agent_kwargs={'prefix': PREFIX, 'format_instructions': FORMAT_INSTRUCTIONS, 'suffix': SUFFIX}, )
79+
state = state + [("I upload a video, Please watch it first! ","I have watch this video, Let's chat!")]
80+
return gr.update(visible = True),state, state
81+
82+
if __name__=="__main__":
83+
import pdb
84+
pdb.set_trace()

video_chat/configs/med_config.json

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"architectures": [
3+
"BertModel"
4+
],
5+
"attention_probs_dropout_prob": 0.1,
6+
"hidden_act": "gelu",
7+
"hidden_dropout_prob": 0.1,
8+
"hidden_size": 768,
9+
"initializer_range": 0.02,
10+
"intermediate_size": 3072,
11+
"layer_norm_eps": 1e-12,
12+
"max_position_embeddings": 512,
13+
"model_type": "bert",
14+
"num_attention_heads": 12,
15+
"num_hidden_layers": 12,
16+
"pad_token_id": 0,
17+
"type_vocab_size": 2,
18+
"vocab_size": 30524,
19+
"encoder_width": 768,
20+
"add_cross_attention": true
21+
}

video_chat/configs/q2l_config.json

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
{
2+
"architectures": [
3+
"BertModel"
4+
],
5+
"attention_probs_dropout_prob": 0.1,
6+
"hidden_act": "gelu",
7+
"hidden_dropout_prob": 0.1,
8+
"hidden_size": 768,
9+
"initializer_range": 0.02,
10+
"intermediate_size": 3072,
11+
"layer_norm_eps": 1e-12,
12+
"max_position_embeddings": 512,
13+
"model_type": "bert",
14+
"num_attention_heads": 4,
15+
"num_hidden_layers": 2,
16+
"pad_token_id": 0,
17+
"type_vocab_size": 2,
18+
"vocab_size": 30522,
19+
"encoder_width": 768,
20+
"add_cross_attention": true,
21+
"add_tag_cross_attention": false
22+
}
23+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"ckpt": "pretrain_model/swin_base_patch4_window7_224_22k.pth",
3+
"vision_width": 1024,
4+
"image_res": 224,
5+
"window_size": 7,
6+
"embed_dim": 128,
7+
"depths": [ 2, 2, 18, 2 ],
8+
"num_heads": [ 4, 8, 16, 32 ]
9+
}
10+

0 commit comments

Comments
 (0)