Skip to content

fisheryang02/Unofficial-MMJ-Bench-Codebase

Repository files navigation

MMJ-Bench

MMJ-Bench

MMJ-Bench is a comprehensive benchmark designed to systematically evaluate existing multi-modal jailbreak attacks and defenses in a unified manner. We present jailbreak attack and defense techniques evaluated in our paper in the following tables.

Multi-model jailbreak attack

Method Source Key Properties
FigStep FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts Generation-based
MM-SafetyBench MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models Generation-based
VisualAdv Visual Adversarial Examples Jailbreak Aligned Large Language Models Optimization-based
ImgJP Jailbreaking Attack against Multimodal Large Language Model Optimization-based
AttackVLM On Evaluating Adversarial Robustness of Large Vision-Language Models Geneation-based
Hades Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. Generation-based

Our evaluation results of MLLMs jailbreak attacks across six models are as follows:

ASR

Multi-modal jailbreak defense

Method Source Key Properties
VLGuard Safety fine-tuning at (almost) no cost: A baseline for vision large language models Proactive defense
Adashield Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting Reactice defense
JailGuard JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks Reactive defense
CIDER Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models Reactive defense

Usage

Installation

conda create -n MMJ-Bench python=3.10
conda create MMJ-Bench
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Step 1 - Generate Test Cases

In the first step, jailbreak attack techniques are used to generate test cases with generate_test_cases.py.

./scripts/generate_test_cases.sh $method_name $behaviors_path $save_dir

Step 2 - Generate Completions

After generating test cases , we can generate completions for a target model with or without defense techniques.

Without defense methods:

./scripts/generate_completions.sh $model_name $behaviors_path $test_cases_path $save_path $max_new_tokens $incremental_update

With defense methods.

./scripts/generate_completions_defense.sh $attack_type $target_model $defense_type

Step 3 - Evaluate Completions

After generate completions from a target_model from Step 2, We will utilize the classifier provided by HarmBench to label whether each completion is an example of its corresponding behavior.

./scripts/evaluate_completions.sh $cls_path $behaviors_path $completions_path $save_path

We also provide a traing scripts of VLGuard for supervised fine-tunning.

Acknowledgement and citation

We thank the following open-source reposities.

[1]  https://github.com/centerforaisafety/HarmBench
[2]  https://github.com/thuccslab/figstep
[3]  https://github.com/isXinLiu/MM-SafetyBench
[4]  https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models
[5]  https://github.com/abc03570128/Jailbreaking-Attack-against-Multimodal-Large-Language-Model
[6]  https://github.com/yunqing-me/AttackVLM
[7]  https://github.com/AoiDragon/HADES
[8]  https://github.com/shiningrain/JailGuard
[9]  https://github.com/SaFoLab-WISC/AdaShield

If you find this useful in your research, please consider citing our paer

@article{weng2024textit,
  title={$\backslash$textit $\{$MMJ-Bench$\}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models},
  author={Weng, Fenghua and Xu, Yue and Fu, Chengyan and Wang, Wenjie},
  journal={arXiv preprint arXiv:2408.08464},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages