-
Notifications
You must be signed in to change notification settings - Fork 527
Branch jailbreakv #1261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Branch jailbreakv #1261
Conversation
Co-authored-by: Florian Cheviron [email protected] Co-authored-by: Mathis Franel [email protected]
Co-authored-by: Florian Cheviron [email protected] Co-authored-by: Mathis Franel [email protected]
Co-authored-by: Florian Cheviron [email protected] Co-authored-by: Mathis Franel [email protected]
DCO Assistant Lite bot: I have read the DCO Document and I hereby sign the DCO 0 out of 2 committers have signed the DCO. |
Will take a look! Can you sign the DCO? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great effort, there are a number of things that may benefit from a refactor. The detector taking various characteristics into account is a nice add. Also the probes may fit better in the visual_jailbreak
module as new probe classes than as a new module set.
Possible location would be:
probes.visual_jailbreak.JailbreakVText
probes.visual_jailbreak.JailbreakVImage
def __init__(self, config_root=_config): | ||
"""Initializes the probe and loads JailbreakV data from Hugging Face or fallback prompts.""" | ||
super().__init__(config_root=config_root) | ||
self.cache_dir = Path(_config.transient.cache_dir) / "data" / "jailbreakv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This project is already well integrated with hugginface, instead of providing a custom cache_dir
this can be removed and rely on the huggingface cache location.
self.cache_dir = Path(_config.transient.cache_dir) / "data" / "jailbreakv" |
dataset = load_dataset( | ||
"JailbreakV-28K/JailBreakV-28k", | ||
"JailBreakV_28K", | ||
cache_dir=str(self.cache_dir / "huggingface_cache"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache dir can simply rely on huggingface default location.
cache_dir=str(self.cache_dir / "huggingface_cache"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class structure in this probe does not match expectations. Classes that extend garak.probes.Probe
are each considered a unique executable probe
. In the current implementation this class would produce:
probes.jailbreakv.JailbreakV
probes.jailbreakv.JailbreakVText
probes.jailbreakv.JailbreakVImage
The implementation however creates duplication as probes.jailbreakv.JailbreakV
looks to be the superset of all probes in probes.jailbreakv.JailbreakVText
and probes.jailbreakv.JailbreakVImage
. If a user were to run:
garak -m nim.Vision -n some_vision_capable_model -p jailbreakv
The run would execute all three probes duplicating prompts as it executes.
To address this either the jailbreakv.JailbreakV
class implementing the shared code should be a mixin that does not extend garak.probes.Probe
exposing only probes.jailbreakv.JailbreakVText
and probes.jailbreakv.JailbreakVImage
for unique prompt sets, or the module should be reduced to a single probe that has DEFAULT_PARAMS
to enable filtering and supported modality requirements.
def get_version(self) -> str: | ||
"""Get detector version for reporting | ||
|
||
Returns: | ||
Semantic version string | ||
""" | ||
return "1.0.0" | ||
|
||
def get_config(self) -> dict: | ||
"""Get detector configuration for audit logging | ||
|
||
Returns: | ||
Dictionary of configuration parameters | ||
""" | ||
return { | ||
"name": self.name, | ||
"description": self.description, | ||
"threshold": self.threshold, | ||
"patterns_count": len(self.signals), | ||
"version": self.get_version(), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these? Currently detectors do not have separate versions from the version of garak
executed.
def get_version(self) -> str: | |
"""Get detector version for reporting | |
Returns: | |
Semantic version string | |
""" | |
return "1.0.0" | |
def get_config(self) -> dict: | |
"""Get detector configuration for audit logging | |
Returns: | |
Dictionary of configuration parameters | |
""" | |
return { | |
"name": self.name, | |
"description": self.description, | |
"threshold": self.threshold, | |
"patterns_count": len(self.signals), | |
"version": self.get_version(), | |
} |
# Performance metrics for Garak reporting | ||
precision = 0.85 # Precision from validation tests | ||
recall = 0.80 # Recall from validation tests | ||
accuracy = 0.82 # Overall accuracy | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are not used.
# Performance metrics for Garak reporting | |
precision = 0.85 # Precision from validation tests | |
recall = 0.80 # Recall from validation tests | |
accuracy = 0.82 # Overall accuracy |
# I/O specification | ||
modality = {"out": {"text"}} # Processes text outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, this is the default modality
for a detectors.
See:
Lines 31 to 35 in 37d046b
# support mainstream any-to-any large models | |
# legal element for str list `modality['in']`: 'text', 'image', 'audio', 'video', '3d' | |
# refer to Table 1 in https://arxiv.org/abs/2401.13601 | |
# we focus on LLM output for detectors | |
modality: dict = {"out": {"text"}} |
# I/O specification | |
modality = {"out": {"text"}} # Processes text outputs |
dataset = load_dataset( | ||
"JailbreakV-28K/JailBreakV-28k", | ||
"JailBreakV_28K", | ||
cache_dir=str(self.cache_dir / "huggingface_cache"), | ||
)["JailBreakV_28K"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be reasonable do a full download of the dataset repo vs just the base dataset table.
dataset = load_dataset( | |
"JailbreakV-28K/JailBreakV-28k", | |
"JailBreakV_28K", | |
cache_dir=str(self.cache_dir / "huggingface_cache"), | |
)["JailBreakV_28K"] | |
from huggingface_hub import snapshot_download | |
snapshot_download(repo_id="JailbreakV-28K/JailBreakV-28k", repo_type="dataset") | |
dataset = load_dataset( | |
"JailbreakV-28K/JailBreakV-28k", | |
"JailBreakV_28K", | |
cache_dir=str(self.cache_dir / "huggingface_cache"), | |
)["JailBreakV_28K"] |
#!/usr/bin/env python3 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Individual probes should not have #!
entries as these are not entry points to execute the project.
#!/usr/bin/env python3 |
# garak/detectors/jailbreakv.py | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed.
# garak/detectors/jailbreakv.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider adding pytest coverage for detection to better document what will pass or fail this detector.
What does this change do?
This PR adds a new multimodal probe (JailbreakV) and a heuristic detector (JailbreakVDetector) to Garak. The probe uses the JailbreakV-28K dataset and supports both text-only and image+text prompts. The detector uses pattern-based heuristics to identify common jailbreak strategies in LLM outputs.
This addresses #1099 by providing robust coverage for JailbreakV-style attacks, including multimodal scenarios.
Note:
Due to the large size of the JailbreakV-28K dataset, running the full pytest suite with this probe can take a long time and may hit Hugging Face rate limits, especially when downloading many images. For practical testing, we recommend using a subset of the dataset during tests.