Link-Context Learning for Multimodal LLMs [CVPR 2024]

Yan Tai^*,2,3,4 Weichen Fan^*,†,3 Zhao Zhang³ Ziwei Liu^✉,1

¹S-Lab, Nanyang Technological University ²Shanghai Jiao Tong University ³SenseTime Research
⁴Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China

^* Equal Contribution ^† Project Lead ^✉ Corresponding Author

Official PyTorch implementation of "Link-Context Learning for Multimodal LLMs" [CVPR 2024].

Updates

28 Feb, 2024 💥💥 Our paper has been accepted by CVPR 2024! 🎉
05 Sep, 2023: We release the code, data, and LCL-2WAY-WEIGHT checkpoint.
24 Aug, 2023: We release the online demo at 🔗LCL-Demo🔗.
17 Aug, 2023: We release the two subsets of ISEKAI (ISEKAI-10 and ISEKAI-pair) at [Hugging Face 🤗].

This repository contains the official implementation and dataset of the following paper:

Link-Context Learning for Multimodal LLMs
https://arxiv.org/abs/2308.07891

Abstract: The ability to learn from context with novel concepts, and deliver appropriate responses are essential in human conversations. Despite current Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being trained on mega-scale datasets, recognizing unseen images or understanding novel concepts in a training-free manner remains a challenge. In-Context Learning (ICL) explores training-free few-shot learning, where models are encouraged to "learn to learn" from limited tasks and generalize to unseen tasks. In this work, we propose link-context learning (LCL), which emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal relationship between the support set and the query set. By providing demonstrations with causal links, LCL guides the model to discern not only the analogy but also the underlying causal associations between data points, which empowers MLLMs to recognize unseen images and understand novel concepts more effectively. To facilitate the evaluation of this novel approach, we introduce the ISEKAI dataset, comprising exclusively of unseen generated image-label pairs designed for link-context learning. Extensive experiments show that our LCL-MLLM exhibits strong link-context learning capabilities to novel concepts over vanilla MLLMs.

Todo

Release the ISEKAI-10 and ISEKAI-pair.
Release the dataset usage.
Release the demo.
Release the codes and checkpoints.
Release the full ISEKAI dataset.
Release checkpoints supporting few-shot detection and vqa tasks.

Install

conda create -n lcl python=3.10
conda activate lcl
pip install -r requirements.txt

configure accelerate

accelerate config

Dataset

ImageNet

We train the LCL setting on our rebuild ImageNet-900 set, and evaluate model on ImageNet-100 set. You can get the dataset json here.

ISEKAI

We evaluate model on ISEKAI-10 and ISEKAI-Pair, you can download ISEKAI Dataset in ISEKAI-10 and ISEKAI-pair.

Checkpoint

Download our LCL-2WAY-WEIGHT and LCL-MIX checkpoints in huggingface.

Demo

To launch a Gradio web demo, use the following command. Please note that the model evaluates in the torch.float16 format, which requires a GPU with at least 16GB of memory.

python ./mllm/demo/demo.py --model_path /path/to/lcl/ckpt

It is also possible to use it in 8-bit quantization, albeit at the expense of sacrificing some performance.

python ./mllm/demo/demo.py --model_path /path/to/lcl/ckpt --load_in_8bit

Train

After preparing data, you can train the model using the command:

LCL-2Way-Weight

accelerate launch --num_processes 4 \
        --main_process_port 23786 \
        mllm/pipeline/finetune.py \
        config/lcl_train_2way_weight.py \
        --cfg-options data_args.use_icl=True \
        --cfg-options model_args.model_name_or_path=/path/to/init/checkpoint

LCL-2Way-Mix

accelerate launch --num_processes 4 \
        --main_process_port 23786 \
        mllm/pipeline/finetune.py \
        config/lcl_train_mix1.py \
        --cfg-options data_args.use_icl=True \
        --cfg-options model_args.model_name_or_path=/path/to/init/checkpoint

Inference

After preparing data, you can inference the model using the command:

ImageNet-100

accelerate launch --num_processes 4 \
        --main_process_port 23786 \
        mllm/pipeline/finetune.py \
        config/lcl_eval_ISEKAI_10.py \
        --cfg-options data_args.use_icl=True \
        --cfg-options model_args.model_name_or_path=/path/to/checkpoint

mmengine style args and huggingface:Trainer args are supported. for example, you can change eval batchsize like this:

ISEKAI

# ISEKAI10
accelerate launch --num_processes 4 \
        --main_process_port 23786 \
        mllm/pipeline/finetune.py \
        config/shikra_eval_multi_pope.py \
        --cfg-options data_args.use_icl=True \
        --cfg-options model_args.model_name_or_path=/path/to/checkpoint \
        --per_device_eval_batch_size 1

# ISEKAI-PAIR
accelerate launch --num_processes 4 \
        --main_process_port 23786 \
        mllm/pipeline/finetune.py \
        config/shikra_eval_multi_pope.py \
        --cfg-options data_args.use_icl=True \
        --cfg-options model_args.model_name_or_path=/path/to/checkpoint \
        --per_device_eval_batch_size 1

where --cfg-options a=balabala b=balabala is mmengine style argument. They will overwrite the argument predefined in config file. And --per_device_eval_batch_size is huggingface:Trainer argument.

the prediction result will be saved in output_dir/multitest_xxxx_extra_prediction.jsonl, which hold the same order as the input dataset.

Cite

@inproceedings{tai2023link,
  title={Link-Context Learning for Multimodal LLMs},
  author={Tai, Yan and Fan, Weichen and Zhang, Zhao and Liu, Ziwei},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
config		config
docs		docs
mllm		mllm
scripts		scripts
tool/imagenet		tool/imagenet
.gitignore		.gitignore
ISEKAI_overview.png		ISEKAI_overview.png
LICENSE		LICENSE
README.md		README.md
launcher_intelmpi.sh		launcher_intelmpi.sh
requirements.txt		requirements.txt
start_in_container.sh		start_in_container.sh
start_in_container_ceph.sh		start_in_container_ceph.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Link-Context Learning for Multimodal LLMs [CVPR 2024]

Updates

Todo

Get Start

Install

configure accelerate

Dataset

ImageNet

ISEKAI

Checkpoint

Demo

Train

LCL-2Way-Weight

LCL-2Way-Mix

Inference

ImageNet-100

ISEKAI

Cite

About

Releases

Packages

Contributors 4

Languages

License

isekai-portal/Link-Context-Learning

Folders and files

Latest commit

History

Repository files navigation

Link-Context Learning for Multimodal LLMs [CVPR 2024]

Updates

Todo

Get Start

Install

configure accelerate

Dataset

ImageNet

ISEKAI

Checkpoint

Demo

Train

LCL-2Way-Weight

LCL-2Way-Mix

Inference

ImageNet-100

ISEKAI

Cite

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages