Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

This is the source code of our AAAI 2024 paper "Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval" [ Paper | Appendix ]

Quick Links

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval
- Quick Links
- Overview
Usage
Citation

Overview

We propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA) to solve the inter-modal matching missing problem and the intra-modal semantic loss problem in existing image-text retrieval. The following figure is an illustration of our methods.

Usage

Getting Started

Environment Installation

See requirements.txt

For training and limited evaluation

# python >= 3.9
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install transformers sentence-transformers tqdm scikit-learn ftfy

For evaluation

# -- ECCV Caption --
# 1. For more detailed information, please refer to https://github.com/naver-ai/eccv-caption
pip install eccv_caption pycocotools ujson

# -- Img Retrieval --
# 1. Our repository contains the relevant code.
# 2. For more detailed information, please refer to https://github.com/deepglint/unicom
pip install pandas

# -- STS --
# 1. Get code from https://github.com/princeton-nlp/SimCSE
# 2. Install SentEval
git clone https://github.com/princeton-nlp/SimCSE.git
# find file "SimCSE/SentEval/senteval/sts.py"
# Modify lines 42 and 43 of the code to read as follows:
# <42> sent1 = np.array([s.split() for s in sent1], dtype=object)[not_empty_idx]
# <43> sent2 = np.array([s.split() for s in sent2], dtype=object)[not_empty_idx]
cd SimCSE/SentEval
pip install .
pip install prettytable

Data Preprocessing

Image Text Retrieval training/evaluation

You should see albef (https://github.com/salesforce/ALBEF) to build a dataset.

For more data examples, see the folder dataset_example.

Here is the data format: train.json

[
  {
        "image_path": "<absPath>/COCO_val2014_000000391895.jpg",
        "caption": "A man with a red helmet on a small moped on a dirt road. ",
        "image_id": "COCO_val2014_000000391895.jpg"
  },
]

train_unicom.npy

{ "<image_id>1": "<feature>", }

Image retrieval task evaluation

You can see the code file: evaluation_img.py For more detailed information, please refer to https://github.com/deepglint/unicom

STS task evaluation

You can see the code file: evaluation_sts.py For more detailed information, please refer to https://github.com/princeton-nlp/SimCSE

Training & Evaluation

Training Scripts:

torchrun --nproc_per_node=4 --master-port 25110 retrieval.py --config "<configPath>"

# Test environment installation successful.
torchrun --nproc_per_node=4 --master-port 25110 retrieval.py --config "./configs/test.yaml"

# e.g.
torchrun --nproc_per_node=4 --master-port 25110 retrieval.py --config "./configs/vitb32/coco/only_contrastive.yaml"
torchrun --nproc_per_node=4 --master-port 25110 retrieval.py --config "./configs/vitb32/coco/cusa.yaml"

Evaluation Scripts:

# -- ECCV Caption --
# see evaluation_eccv.py
python evaluation_eccv.py

# -- Img Retrieval --
# see evaluation_img.py
python evaluation_img.py

# -- STS --
# see evaluation_sts.py
python evaluation_sts.py

Q&A

NOTE: The submitted code is code that has been refactored, so in some cases it may contain some bugs that we didn't catch, but that doesn't affect the results in our paper.

If you have any questions, please submit an issue or contact lerogohl<AT>gmail.com or huanghl<AT>buaa.edu.cn.

Datasets, Checkpoints(re-run), and Logs(re-run) can be found at this link: google drive

Citation

If you find this method or code useful, please cite

@inproceedings{huang2024cusa,
  title={Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval},
  author={Huang, Hailang and Nie, Zhijie and Wang, Ziqiao and Shang, Ziyu},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={16},
  pages={18298--18306},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
_doc		_doc
clip		clip
configs		configs
dataset		dataset
dataset_evalimg		dataset_evalimg
dataset_example		dataset_example
optim		optim
scheduler		scheduler
unire		unire
README.md		README.md
evaluation.py		evaluation.py
evaluation_eccv.py		evaluation_eccv.py
evaluation_img.py		evaluation_img.py
evaluation_sts.py		evaluation_sts.py
requirements.txt		requirements.txt
retrieval.py		retrieval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

Quick Links

Overview

Usage

Getting Started

Environment Installation

Data Preprocessing

Training & Evaluation

Q&A

Citation

About

Releases

Packages

Languages

lerogo/aaai24_itr_cusa

Folders and files

Latest commit

History

Repository files navigation

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

Quick Links

Overview

Usage

Getting Started

Environment Installation

Data Preprocessing

Training & Evaluation

Q&A

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages