SkimCap: A Transformer-Based Video Captioning Method with Adaptive Attention and Hierarchical Skimming Features

PyTorch code for our Sibgrapi 2025 paper "SkimCap: A Transformer-Based Video Captioning Method with Adaptive Attention and Hierarchical Skimming Features" Enhanced by Leonardo Vilela Cardoso, Bernardo Palmer, Silvio Jamil F. Guimarães and Zenilton K. G. Patrocínio Jr,

We present SkimCap, a transformer-based video captioning framework that integrates a memory-augmented architecture with adaptive attention and a novel feature selection strategy grounded in hierarchical video skimming. Unlike traditional approaches that rely on uniformly sampled frames or pre-defined temporal segments, SkimCap performs unsupervised hierarchical clustering to identify and extract semantically salient video shots. These condensed representations provide a compact yet information-rich input to the captioning model, enabling more accurate and contextually grounded sentence generation. The memory module enhances long-range dependency modeling, while adaptive attention improves temporal alignment between visual cues and generated tokens. We evaluate SkimCap on ActivityNet, achieving CIDEr-D of 25.44, a BLEU-4 (B@4) of 10.77, and a lower Repetition-4 (R@4) score of 5.84, representing consistent caption quality and relevance improvements. An ablation study confirms the effectiveness of hierarchical skimming as a feature selection mechanism, highlighting its contribution to overall performance. SkimCap sets a new direction for incorporating structured visual summarization into end-to-end captioning systems.

Main dependencies

Developed, checked and verified on an Ubuntu 22.04 PC with a RTX Quadro A6000 GPU. Main packages required:

`Python`	`PyTorch`	`CUDA Version`	`cuDNN Version`	`TensorBoard`	`TensorFlow`	`NumPy`	`H5py`
3.9	2.4.1	11.0	8005	2.4.1	2.3.0	1.20.2	2.10.0

Data

Original videos and annotations for each dataset are also available in the dataset providers' webpages:

Getting started

Prerequisites

Clone this repository

# no need to add --recursive as all dependencies are copied into this repo.
git clone https://github.com/IMScience-PPGINF-PucMinas/Adaptive-Transformer.git
cd Adaptive-Transformer

Prepare feature files

Download features from Google Drive: rt_anet_feat.tar.gz (39GB) and rt_yc2_feat.tar.gz (12GB). These features are repacked from features provided by densecap.

mkdir video_feature && cd video_feature
tar -xf path/to/rt_anet_feat.tar.gz 
tar -xf path/to/rt_yc2_feat.tar.gz

Install dependencies

Python 3.9
PyTorch 2.4.1
nltk
easydict
tqdm
tensorboardX

Add project root to PYTHONPATH

source setup.sh

Note that you need to do this each time you start a new session.

Training and Inference

We give examples on how to perform training and inference with Adaptive-Transformer.

Build Vocabulary

bash scripts/build_vocab.sh

DATASET_NAME can be anet for ActivityNet Captions or yc2 for YouCookII.

Adaptive-Transformer training

The general training command is:

bash scripts/train.sh

To train our Adaptive-Transformer model on ActivityNet Captions:

bash scripts/train.sh anet

Training log and model will be saved at results/anet_re_*.
Once you have a trained model, you can follow the instructions below to generate captions.

Generate captions

bash scripts/translate_greedy.sh anet_re_* val

Replace anet_re_* with your own model directory name. The generated captions are saved at results/anet_re_*/greedy_pred_val.json

Evaluate generated captions

bash scripts/eval.sh anet val results/anet_re_*/greedy_pred_val.json

The results should be comparable with the results we present at Table 2 of the paper. E.g., B@4 10.77; C 25.44 R@4 5.84.

Citations

If you find this code useful for your research, consider citing one of our papers:

@article{cardoso2023hierarchical,
  title={Hierarchical time-aware summarization with an adaptive transformer for video captioning},
  author={Cardoso, Leonardo Vilela and Guimar{\~a}es, Silvio Jamil Ferzoli and do Patroc{\'\i}nio J{\'u}nior, Zenilton Kleber Gon{\c{c}}alves},
  journal={International Journal of Semantic Computing},
  volume={17},
  number={04},
  pages={569--592},
  year={2023},
  publisher={World Scientific}
}
@inproceedings{cardoso2022exploring,
  title={Exploring adaptive attention in memory transformer applied to coherent video paragraph captioning},
  author={Cardoso, Leonardo Vilela and Guimaraes, Silvio Jamil F and Patrocinio, Zenilton KG},
  booktitle={2022 IEEE Eighth International Conference on Multimedia Big Data (BigMM)},
  pages={37--44},
  year={2022},
  organization={IEEE}
}

@inproceedings{cardoso2021enhanced,
  title={Enhanced-Memory Transformer for Coherent Paragraph Video Captioning},
  author={Cardoso, Leonardo Vilela and Guimaraes, Silvio Jamil F and Patroc{\'\i}nio, Zenilton KG},
  booktitle={2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI)},
  pages={836--840},
  year={2021},
  organization={IEEE}
}

Others

This code used resources from the following projects: emt, mart, transformers, transformer-xl, densecap, OpenNMT-py.

Contact

Leonardo Vilela Cardoso with this e-mail: lvcardoso@sga.pucminas.br

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
cache		cache
densevid_eval		densevid_eval
scripts		scripts
src		src
README.md		README.md
all_hierarchy.py		all_hierarchy.py
setup.sh		setup.sh
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkimCap: A Transformer-Based Video Captioning Method with Adaptive Attention and Hierarchical Skimming Features

Main dependencies

Data

Getting started

Prerequisites

Training and Inference

Citations

Others

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SkimCap: A Transformer-Based Video Captioning Method with Adaptive Attention and Hierarchical Skimming Features

Main dependencies

Data

Getting started

Prerequisites

Training and Inference

Citations

Others

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages