- Introduction
- Features
- Installation
- Data Preparation
- Models
- Usage
- Evaluation
- Results
- Contributing
- License
- Citation
- Contact
VueICL is a framework designed for Entity-Aware Video Question Answering leveraging In-Context Learning of visual annotations. By integrating personalized visual markers with state-of-the-art Vision-Language Models (VLMs), VueICL enables efficient and accurate answering of entity-specific queries in long videos without the need for extensive model fine-tuning.
- Two-Stage Pipeline: Annotate video frames with visual markers and integrate these annotations into VLMs.
- Entity-Aware Reasoning: Enhanced capability to reference and differentiate multiple entities within a single conversation context.
- Scalable and Efficient: Avoids computational overhead associated with traditional fine-tuning methods.
- Benchmark and Evaluation Scripts: Provides a curated benchmark for entity-aware video question answering.
- Python 3.8+
- Conda (for environment management)
-
Clone the Repository
git clone https://github.com/yourusername/VueICL.git cd VueICL -
Create a Conda Environment
conda create -n vueicl_env python=3.8 conda activate vueicl_env
-
Install Dependencies
pip install -r requirements.txt
Ensure that your data directory follows the structure below:
raw_videos_characters/
├── video1/
│ ├── 1.jpg
│ ├── 2.jpg
│ ├── video1.mp4
│ └── ...
├── video2/
│ ├── 1.jpg
│ ├── 2.jpg
│ ├── 3.jpg
│ ├── 4.png
│ ├── video2.mp4
│ └── ...
└── ...
- raw_videos_characters/: Root directory containing all raw videos.
- videoX/: Subdirectories for each video, containing annotated frames.
We provide all necessary data and evaluation scripts in our Google Drive. Please download and place the raw_videos_characters folder in the root directory of the repository.
Download the LongVU model from the official repository:
- GitHub Repository: LongVU GitHub
Download the Qwen2-VL model from the official repository:
- GitHub Repository: Qwen2-VL GitHub
Note: Ensure that both models are correctly placed in the specified directories as per the instructions in their respective repositories.
-
Annotate Videos Ensure that your videos are annotated with red bounding boxes and corresponding entity labels as per the folder structure.
-
Execute the Inference Script
python inference.py --model LongVU --model_path path_to_longvu_model --data_path raw_videos_characters
Replace
path_to_longvu_modelwith the actual path to your LongVU model. -
Evaluate Performance
python evaluate.py --results results.json --benchmark_path path_to_benchmark
Replace
path_to_benchmarkwith the actual path to your benchmark dataset.
Our evaluation is based on a curated benchmark specifically designed for entity-aware video question answering. The benchmark includes a set of closed-ended questions tailored to assess the model's ability to identify and reason about specific entities within video content.
- Questions: Number of questions answered correctly out of 100.
- Videos: Number of videos where all questions were answered correctly out of 22.
Refer to the Results section in the paper for detailed performance metrics. Below is a summary of our findings:
| Method | Questions | Videos |
|---|---|---|
| LongVU Empty video | 44/100 | 1/22 |
| LongVU No annotation | 54/100 | 3/22 |
| Qwen2-VL | 49/100 | 4/22 |
| VueICL long prompt | 67/100 | 6/22 |
| VueICL short prompt | 68/100 | 6/22 |
Figure 4: Performance comparison graph of different methods on the entity-aware video understanding benchmark.
Figure 4: Performance comparison graph of different methods on the entity-aware video understanding benchmark. VueICL methods significantly outperform existing baselines and state-of-the-art models in both question-solving accuracy and video-level comprehension.
Contributions are welcome! Please follow these steps to contribute:
- Fork the Repository
- Create a Feature Branch
git checkout -b feature/YourFeature
- Commit Your Changes
git commit -m "Add Your Feature" - Push to the Branch
git push origin feature/YourFeature
- Open a Pull Request
Please ensure that your contributions adhere to the project's coding standards and include appropriate tests.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this work in your research, please cite it as follows:
@inproceedings{your2025vueicl,
title={VueICL: Entity-Aware Video Question Answering through In-Context Learning of Visual Annotations},
author={Levinson, Shahaf and Elgov, Ram and Benizri, Yonatan and Schwartz, Idan},
booktitle={Proceedings of the 38th International Conference on Machine Learning (ICML)},
year={2025},
}For any questions or suggestions, please contact:
- Shahaf Levinson - [email protected]
- Ram Elgov - [email protected]
- Yonatan Benizri - [email protected]
- Idan Schwartz - [email protected]