Skip to content

Branch jailbreakv #1261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

Branch jailbreakv #1261

wants to merge 9 commits into from

Conversation

N0xAh
Copy link

@N0xAh N0xAh commented Jun 17, 2025

What does this change do?

This PR adds a new multimodal probe (JailbreakV) and a heuristic detector (JailbreakVDetector) to Garak. The probe uses the JailbreakV-28K dataset and supports both text-only and image+text prompts. The detector uses pattern-based heuristics to identify common jailbreak strategies in LLM outputs.

This addresses #1099 by providing robust coverage for JailbreakV-style attacks, including multimodal scenarios.

Note:
Due to the large size of the JailbreakV-28K dataset, running the full pytest suite with this probe can take a long time and may hit Hugging Face rate limits, especially when downloading many images. For practical testing, we recommend using a subset of the dataset during tests.

Copy link
Contributor

DCO Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by just posting a Pull Request Comment same as the below format.


I have read the DCO Document and I hereby sign the DCO


0 out of 2 committers have signed the DCO.
@MathisFranel
@N0xAh
You can retrigger this bot by commenting recheck in this Pull Request

@leondz
Copy link
Collaborator

leondz commented Jun 26, 2025

Will take a look! Can you sign the DCO?

@leondz leondz added probes Content & activity of LLM probes detectors work on code that inherits from or manages Detector new plugin Describes an entirely new probe, detector, generator or harness labels Jun 26, 2025
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great effort, there are a number of things that may benefit from a refactor. The detector taking various characteristics into account is a nice add. Also the probes may fit better in the visual_jailbreak module as new probe classes than as a new module set.

Possible location would be:

probes.visual_jailbreak.JailbreakVText
probes.visual_jailbreak.JailbreakVImage

def __init__(self, config_root=_config):
"""Initializes the probe and loads JailbreakV data from Hugging Face or fallback prompts."""
super().__init__(config_root=config_root)
self.cache_dir = Path(_config.transient.cache_dir) / "data" / "jailbreakv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This project is already well integrated with hugginface, instead of providing a custom cache_dir this can be removed and rely on the huggingface cache location.

Suggested change
self.cache_dir = Path(_config.transient.cache_dir) / "data" / "jailbreakv"

dataset = load_dataset(
"JailbreakV-28K/JailBreakV-28k",
"JailBreakV_28K",
cache_dir=str(self.cache_dir / "huggingface_cache"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache dir can simply rely on huggingface default location.

Suggested change
cache_dir=str(self.cache_dir / "huggingface_cache"),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class structure in this probe does not match expectations. Classes that extend garak.probes.Probe are each considered a unique executable probe. In the current implementation this class would produce:

probes.jailbreakv.JailbreakV
probes.jailbreakv.JailbreakVText
probes.jailbreakv.JailbreakVImage

The implementation however creates duplication as probes.jailbreakv.JailbreakV looks to be the superset of all probes in probes.jailbreakv.JailbreakVText and probes.jailbreakv.JailbreakVImage. If a user were to run:

garak -m nim.Vision -n some_vision_capable_model -p jailbreakv

The run would execute all three probes duplicating prompts as it executes.

To address this either the jailbreakv.JailbreakV class implementing the shared code should be a mixin that does not extend garak.probes.Probe exposing only probes.jailbreakv.JailbreakVText and probes.jailbreakv.JailbreakVImage for unique prompt sets, or the module should be reduced to a single probe that has DEFAULT_PARAMS to enable filtering and supported modality requirements.

Comment on lines +160 to +180
def get_version(self) -> str:
"""Get detector version for reporting

Returns:
Semantic version string
"""
return "1.0.0"

def get_config(self) -> dict:
"""Get detector configuration for audit logging

Returns:
Dictionary of configuration parameters
"""
return {
"name": self.name,
"description": self.description,
"threshold": self.threshold,
"patterns_count": len(self.signals),
"version": self.get_version(),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these? Currently detectors do not have separate versions from the version of garak executed.

Suggested change
def get_version(self) -> str:
"""Get detector version for reporting
Returns:
Semantic version string
"""
return "1.0.0"
def get_config(self) -> dict:
"""Get detector configuration for audit logging
Returns:
Dictionary of configuration parameters
"""
return {
"name": self.name,
"description": self.description,
"threshold": self.threshold,
"patterns_count": len(self.signals),
"version": self.get_version(),
}

Comment on lines +27 to +31
# Performance metrics for Garak reporting
precision = 0.85 # Precision from validation tests
recall = 0.80 # Recall from validation tests
accuracy = 0.82 # Overall accuracy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are not used.

Suggested change
# Performance metrics for Garak reporting
precision = 0.85 # Precision from validation tests
recall = 0.80 # Recall from validation tests
accuracy = 0.82 # Overall accuracy

Comment on lines +32 to +33
# I/O specification
modality = {"out": {"text"}} # Processes text outputs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, this is the default modality for a detectors.
See:

# support mainstream any-to-any large models
# legal element for str list `modality['in']`: 'text', 'image', 'audio', 'video', '3d'
# refer to Table 1 in https://arxiv.org/abs/2401.13601
# we focus on LLM output for detectors
modality: dict = {"out": {"text"}}

Suggested change
# I/O specification
modality = {"out": {"text"}} # Processes text outputs

Comment on lines +78 to +82
dataset = load_dataset(
"JailbreakV-28K/JailBreakV-28k",
"JailBreakV_28K",
cache_dir=str(self.cache_dir / "huggingface_cache"),
)["JailBreakV_28K"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be reasonable do a full download of the dataset repo vs just the base dataset table.

Suggested change
dataset = load_dataset(
"JailbreakV-28K/JailBreakV-28k",
"JailBreakV_28K",
cache_dir=str(self.cache_dir / "huggingface_cache"),
)["JailBreakV_28K"]
from huggingface_hub import snapshot_download
snapshot_download(repo_id="JailbreakV-28K/JailBreakV-28k", repo_type="dataset")
dataset = load_dataset(
"JailbreakV-28K/JailBreakV-28k",
"JailBreakV_28K",
cache_dir=str(self.cache_dir / "huggingface_cache"),
)["JailBreakV_28K"]

Comment on lines +1 to +2
#!/usr/bin/env python3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Individual probes should not have #! entries as these are not entry points to execute the project.

Suggested change
#!/usr/bin/env python3

Comment on lines +1 to +2
# garak/detectors/jailbreakv.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.

Suggested change
# garak/detectors/jailbreakv.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding pytest coverage for detection to better document what will pass or fail this detector.

@leondz leondz marked this pull request as draft August 7, 2025 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detectors work on code that inherits from or manages Detector new plugin Describes an entirely new probe, detector, generator or harness probes Content & activity of LLM probes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants