MMJ-Bench is a comprehensive benchmark designed to systematically evaluate existing multi-modal jailbreak attacks and defenses in a unified manner. We present jailbreak attack and defense techniques evaluated in our paper in the following tables.
Method | Source | Key Properties |
---|---|---|
FigStep | FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts | Generation-based |
MM-SafetyBench | MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models | Generation-based |
VisualAdv | Visual Adversarial Examples Jailbreak Aligned Large Language Models | Optimization-based |
ImgJP | Jailbreaking Attack against Multimodal Large Language Model | Optimization-based |
AttackVLM | On Evaluating Adversarial Robustness of Large Vision-Language Models | Geneation-based |
Hades | Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. | Generation-based |
Our evaluation results of MLLMs jailbreak attacks across six models are as follows:
Method | Source | Key Properties |
---|---|---|
VLGuard | Safety fine-tuning at (almost) no cost: A baseline for vision large language models | Proactive defense |
Adashield | Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting | Reactice defense |
JailGuard | JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks | Reactive defense |
CIDER | Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Reactive defense |
conda create -n MMJ-Bench python=3.10
conda create MMJ-Bench
pip install -r requirements.txt
python -m spacy download en_core_web_sm
In the first step, jailbreak attack techniques are used to generate test cases with generate_test_cases.py
.
./scripts/generate_test_cases.sh $method_name $behaviors_path $save_dir
After generating test cases , we can generate completions for a target model with or without defense techniques.
Without defense methods:
./scripts/generate_completions.sh $model_name $behaviors_path $test_cases_path $save_path $max_new_tokens $incremental_update
With defense methods.
./scripts/generate_completions_defense.sh $attack_type $target_model $defense_type
After generate completions from a target_model
from Step 2, We will utilize the classifier provided by HarmBench to label whether each completion is an example of its corresponding behavior.
./scripts/evaluate_completions.sh $cls_path $behaviors_path $completions_path $save_path
We also provide a traing scripts of VLGuard for supervised fine-tunning.
We thank the following open-source reposities.
[1] https://github.com/centerforaisafety/HarmBench
[2] https://github.com/thuccslab/figstep
[3] https://github.com/isXinLiu/MM-SafetyBench
[4] https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models
[5] https://github.com/abc03570128/Jailbreaking-Attack-against-Multimodal-Large-Language-Model
[6] https://github.com/yunqing-me/AttackVLM
[7] https://github.com/AoiDragon/HADES
[8] https://github.com/shiningrain/JailGuard
[9] https://github.com/SaFoLab-WISC/AdaShield
If you find this useful in your research, please consider citing our paer
@article{weng2024textit,
title={$\backslash$textit $\{$MMJ-Bench$\}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models},
author={Weng, Fenghua and Xu, Yue and Fu, Chengyan and Wang, Wenjie},
journal={arXiv preprint arXiv:2408.08464},
year={2024}
}