VueICL: Entity-Aware Video Question Answering through In-Context Learning of Visual Annotations

Introduction

VueICL is a framework designed for Entity-Aware Video Question Answering leveraging In-Context Learning of visual annotations. By integrating personalized visual markers with state-of-the-art Vision-Language Models (VLMs), VueICL enables efficient and accurate answering of entity-specific queries in long videos without the need for extensive model fine-tuning.

Features

Two-Stage Pipeline: Annotate video frames with visual markers and integrate these annotations into VLMs.
Entity-Aware Reasoning: Enhanced capability to reference and differentiate multiple entities within a single conversation context.
Scalable and Efficient: Avoids computational overhead associated with traditional fine-tuning methods.
Benchmark and Evaluation Scripts: Provides a curated benchmark for entity-aware video question answering.

Installation

Prerequisites

Python 3.8+
Conda (for environment management)

Steps

Clone the Repository

git clone https://github.com/yourusername/VueICL.git
cd VueICL

Create a Conda Environment

conda create -n vueicl_env python=3.8
conda activate vueicl_env

Install Dependencies
```
pip install -r requirements.txt
```

Data Preparation

Folder Structure

Ensure that your data directory follows the structure below:

raw_videos_characters/
├── video1/
│   ├── 1.jpg
│   ├── 2.jpg
│   ├── video1.mp4
│   └── ...
├── video2/
│   ├── 1.jpg
│   ├── 2.jpg
│   ├── 3.jpg
│   ├── 4.png
│   ├── video2.mp4
│   └── ...
└── ...

Details

raw_videos_characters/: Root directory containing all raw videos.
videoX/: Subdirectories for each video, containing annotated frames.

Data and Evaluation Scripts

We provide all necessary data and evaluation scripts in our Google Drive. Please download and place the raw_videos_characters folder in the root directory of the repository.

Models

LongVU

Download the LongVU model from the official repository:

GitHub Repository: LongVU GitHub

Qwen2-VL

Download the Qwen2-VL model from the official repository:

GitHub Repository: Qwen2-VL GitHub

Note: Ensure that both models are correctly placed in the specified directories as per the instructions in their respective repositories.

Usage

Running the Framework

Annotate Videos Ensure that your videos are annotated with red bounding boxes and corresponding entity labels as per the folder structure.

Execute the Inference Script

python inference.py --model LongVU --model_path path_to_longvu_model --data_path raw_videos_characters

Replace path_to_longvu_model with the actual path to your LongVU model.

Evaluate Performance
```
python evaluate.py --results results.json --benchmark_path path_to_benchmark
```
Replace path_to_benchmark with the actual path to your benchmark dataset.

Evaluation

Our evaluation is based on a curated benchmark specifically designed for entity-aware video question answering. The benchmark includes a set of closed-ended questions tailored to assess the model's ability to identify and reason about specific entities within video content.

Metrics

Questions: Number of questions answered correctly out of 100.
Videos: Number of videos where all questions were answered correctly out of 22.

Results

Refer to the Results section in the paper for detailed performance metrics. Below is a summary of our findings:

Method	Questions	Videos
LongVU Empty video	44/100	1/22
LongVU No annotation	54/100	3/22
Qwen2-VL	49/100	4/22
VueICL long prompt	67/100	6/22
VueICL short prompt	68/100	6/22

Figure 4: Performance comparison graph of different methods on the entity-aware video understanding benchmark.

Figure 4: Performance comparison graph of different methods on the entity-aware video understanding benchmark. VueICL methods significantly outperform existing baselines and state-of-the-art models in both question-solving accuracy and video-level comprehension.

Contributing

Contributions are welcome! Please follow these steps to contribute:

Fork the Repository
Create a Feature Branch
```
git checkout -b feature/YourFeature
```
Commit Your Changes
```
git commit -m "Add Your Feature"
```
Push to the Branch
```
git push origin feature/YourFeature
```
Open a Pull Request

Please ensure that your contributions adhere to the project's coding standards and include appropriate tests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this work in your research, please cite it as follows:

@inproceedings{your2025vueicl,
  title={VueICL: Entity-Aware Video Question Answering through In-Context Learning of Visual Annotations},
  author={Levinson, Shahaf and Elgov, Ram and Benizri, Yonatan and Schwartz, Idan},
  booktitle={Proceedings of the 38th International Conference on Machine Learning (ICML)},
  year={2025},
}

Contact

For any questions or suggestions, please contact:

Shahaf Levinson - [email protected]
Ram Elgov - [email protected]
Yonatan Benizri - [email protected]
Idan Schwartz - [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
results		results
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
baseline_experiment.py		baseline_experiment.py
change_resolution.fish		change_resolution.fish
face_recognition_stage.py		face_recognition_stage.py
infer_videos.py		infer_videos.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VueICL: Entity-Aware Video Question Answering through In-Context Learning of Visual Annotations

Table of Contents

Introduction

Features

Installation

Prerequisites

Steps

Data Preparation

Folder Structure

Details

Data and Evaluation Scripts

Models

LongVU

Qwen2-VL

Usage

Running the Framework

Evaluation

Metrics

Results

Contributing

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

ram-elgov/VueICL

Folders and files

Latest commit

History

Repository files navigation

VueICL: Entity-Aware Video Question Answering through In-Context Learning of Visual Annotations

Table of Contents

Introduction

Features

Installation

Prerequisites

Steps

Data Preparation

Folder Structure

Details

Data and Evaluation Scripts

Models

LongVU

Qwen2-VL

Usage

Running the Framework

Evaluation

Metrics

Results

Contributing

License

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages