Pytorch implementation for RERD
Deep Multimodal Sequence Fusion by Regularized Expressive Representation Distillation
[Xiaobao Guo], [Adams Wai-Kin Kong] *, and [Alex Kot]
IEEE Transactions on Multimedia, 2022.
Please cite our paper if you find our work useful for your research:
@article{guo2022deep,
title={Deep Multimodal Sequence Fusion by Regularized Expressive Representation Distillation},
author={Guo, Xiaobao and Kong, Wai-Kin Adams and Kot, Alex C},
journal={IEEE Transactions on Multimedia},
year={2022},
publisher={IEEE}
}
RERD is comprised of the two major components based on an intermediate-fusion pipeline: (1) a multi-head distillation encoder is proposed to enhance unimodal representations from unaligned multimodal sequences, where the distillation attention layers dynamically capture and extract the most expressive unimodal features and (2) a novel multimodal sinkhorn distance regularizer is introduced to aid the joint optimization in training.
- Python 3.7 or above
- Pytorch (>=1.0.0) and torchvision
- CUDA 10.0 or above
The processed MOSI, MOSEI can be downloaded from here.
The SIMS dataset can be downloaded from here
Bert pretrained model can be found from here
- Create folders for data and models:
mkdir data all_models
mkdir data/pretrained_bert
and put or link the data under 'data/'.
- Training:
python main.py [--params]
e.g.,
CUDA_VISIBLE_DEVICES=4,5 python main.py \
--model=RERD --lonly --aonly --vonly \
--name='RERD-01' \
--dataset='mosei' --data_path='./data/MOSEI' \
--batch_size=16 --use_bert=True\
--bert_path='./data/pretrained_bert/'
--dis_d_mode=64 --dis_n_heads=4 --dis_e_layers=2 \
--optim='Adam' --reg_lambda=0.1 \
--schedule='c' --lr=0.001 --nlevels=2
Some portion of the code were adapted from the fairseq, MMSA, and Informer repo.