Video Question Answering with Prior Knowledge and Object-sensitive Learning

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen

Paper | TIP 2022

Figure 1. Overview of the proposed PKOL architecture for video question answering.

Setups

Ubuntu 20.04
CUDA 11.5
Python 3.7
PyTorch 1.7.0 + cu110

Clone this repository：

git clone https://github.com/zchoi/PKOL.git

Install dependencies：

conda create -n vqa python=3.7
conda activate vqa
pip install -r requirements.txt

Data Preparation

Text Features

Download pre-extracted text features from here (code: zca5), and place it into data/{dataset}-qa/ for MSVD-QA, MSRVTT-QA and data/tgif-qa/{question_type}/ for TGIF-QA, respectively.
Visual Features
- For appearance and motion features, we used this repo [1].
- For object features, we used the Faster R-CNN [2] pre-trained with Visual Genome [3].
Download pre-extracted visual features from here (code: zca5), and place it into data/{dataset}-qa/ for MSVD-QA, MSRVTT-QA and data/tgif-qa/{question_type}/ for TGIF-QA, respectively.

Important

The object features are huge, (especially ~700GB for TGIF-QA), please be cautious of disk space when downloading.

Experiments

For MSVD-QA and MSRVTT-QA：

Training：

python train_iterative.py --cfg configs/msvd_qa.yml

Evaluation：

python validate_iterative.py --cfg configs/msvd_qa.yml

For TGIF-QA：

Choose a suitable config file in configs/{task}.yml for one of 4 tasks: action, transition, count, frameqa to train/val the model. For example, to train with action task, run the following command:

Training：

python train_iterative.py --cfg configs/tgif_qa_action.yml

Evaluation：

python validate_iterative.py --cfg configs/tgif_qa_action.yml

Results

Performance on MSVD-QA and MSRVTT-QA datasets:

Model	MSVD-QA	MSRVTT-QA
PKOL	41.1	36.9

Performance on TGIF-QA dataset:

Model	Count ↓	FrameQA ↑	Trans. ↑	Action ↑
PKOL	3.67	61.8	82.8	74.6

Reference

[1] Le, Thao Minh, et al. "Hierarchical conditional relation networks for video question answering." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[2] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015).

[3] Krishna, Ranjay, et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." International journal of computer vision 123.1 (2017): 32-73.

Citation

@article{PKOL,
  title   = {Video Question Answering with Prior Knowledge and Object-sensitive Learning},
  author  = {Pengpeng Zeng and 
             Haonan Zhang and 
             Lianli Gao and 
             Jingkuan Song and 
             Heng Tao Shen
             },
  journal = {IEEE Transactions on Image Processing},
  doi     = {10.1109/TIP.2022.3205212},
  pages   = {5936--5948}
  year    = {2022}
}

Acknowledgements

Our code implementation is based on this repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Video Question Answering with Prior Knowledge and Object-sensitive Learning

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen

Table of Contents

Setups

Data Preparation

Text Features

Visual Features

Experiments

For MSVD-QA and MSRVTT-QA：

For TGIF-QA：

Results

Reference

Citation

Acknowledgements

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Video Question Answering with Prior Knowledge and Object-sensitive Learning

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen

Table of Contents

Setups

Data Preparation

Text Features

Visual Features

Experiments

For MSVD-QA and MSRVTT-QA：

For TGIF-QA：

Results

Reference

Citation

Acknowledgements