This is the source code of our AAAI 2024 paper "Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval" [ Paper | Appendix ]
We propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA) to solve the inter-modal matching missing problem and the intra-modal semantic loss problem in existing image-text retrieval. The following figure is an illustration of our methods.
See requirements.txt
For training and limited evaluation
# python >= 3.9
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install transformers sentence-transformers tqdm scikit-learn ftfy
For evaluation
# -- ECCV Caption --
# 1. For more detailed information, please refer to https://github.com/naver-ai/eccv-caption
pip install eccv_caption pycocotools ujson
# -- Img Retrieval --
# 1. Our repository contains the relevant code.
# 2. For more detailed information, please refer to https://github.com/deepglint/unicom
pip install pandas
# -- STS --
# 1. Get code from https://github.com/princeton-nlp/SimCSE
# 2. Install SentEval
git clone https://github.com/princeton-nlp/SimCSE.git
# find file "SimCSE/SentEval/senteval/sts.py"
# Modify lines 42 and 43 of the code to read as follows:
# <42> sent1 = np.array([s.split() for s in sent1], dtype=object)[not_empty_idx]
# <43> sent2 = np.array([s.split() for s in sent2], dtype=object)[not_empty_idx]
cd SimCSE/SentEval
pip install .
pip install prettytable
Image Text Retrieval training/evaluation
You should see albef (https://github.com/salesforce/ALBEF) to build a dataset.
For more data examples, see the folder dataset_example
.
Here is the data format:
train.json
[
{
"image_path": "<absPath>/COCO_val2014_000000391895.jpg",
"caption": "A man with a red helmet on a small moped on a dirt road. ",
"image_id": "COCO_val2014_000000391895.jpg"
},
]
train_unicom.npy
{ "<image_id>1": "<feature>", }
Image retrieval task evaluation
You can see the code file: evaluation_img.py
For more detailed information, please refer to https://github.com/deepglint/unicom
STS task evaluation
You can see the code file: evaluation_sts.py
For more detailed information, please refer to https://github.com/princeton-nlp/SimCSE
Training Scripts:
torchrun --nproc_per_node=4 --master-port 25110 retrieval.py --config "<configPath>"
# Test environment installation successful.
torchrun --nproc_per_node=4 --master-port 25110 retrieval.py --config "./configs/test.yaml"
# e.g.
torchrun --nproc_per_node=4 --master-port 25110 retrieval.py --config "./configs/vitb32/coco/only_contrastive.yaml"
torchrun --nproc_per_node=4 --master-port 25110 retrieval.py --config "./configs/vitb32/coco/cusa.yaml"
Evaluation Scripts:
# -- ECCV Caption --
# see evaluation_eccv.py
python evaluation_eccv.py
# -- Img Retrieval --
# see evaluation_img.py
python evaluation_img.py
# -- STS --
# see evaluation_sts.py
python evaluation_sts.py
NOTE: The submitted code is code that has been refactored, so in some cases it may contain some bugs that we didn't catch, but that doesn't affect the results in our paper.
If you have any questions, please submit an issue or contact lerogohl<AT>gmail.com or huanghl<AT>buaa.edu.cn.
- Datasets, Checkpoints(re-run), and Logs(re-run) can be found at this link: google drive
If you find this method or code useful, please cite
@inproceedings{huang2024cusa,
title={Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval},
author={Huang, Hailang and Nie, Zhijie and Wang, Ziqiao and Shang, Ziyu},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={16},
pages={18298--18306},
year={2024}
}