Branch jailbreakv #1261

N0xAh · 2025-06-17T14:44:49Z

What does this change do?

This PR adds a new multimodal probe (JailbreakV) and a heuristic detector (JailbreakVDetector) to Garak. The probe uses the JailbreakV-28K dataset and supports both text-only and image+text prompts. The detector uses pattern-based heuristics to identify common jailbreak strategies in LLM outputs.

This addresses #1099 by providing robust coverage for JailbreakV-style attacks, including multimodal scenarios.

Note:
Due to the large size of the JailbreakV-28K dataset, running the full pytest suite with this probe can take a long time and may hit Hugging Face rate limits, especially when downloading many images. For practical testing, we recommend using a subset of the dataset during tests.

Co-authored-by: Florian Cheviron [email protected] Co-authored-by: Mathis Franel [email protected]

github-actions · 2025-06-17T14:44:59Z

DCO Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by just posting a Pull Request Comment same as the below format.

I have read the DCO Document and I hereby sign the DCO

0 out of 2 committers have signed the DCO.
❌ @MathisFranel
❌ @N0xAh
_{You can retrigger this bot by commenting recheck in this Pull Request}

leondz · 2025-06-26T04:50:28Z

Will take a look! Can you sign the DCO?

jmartin-tech

This is a great effort, there are a number of things that may benefit from a refactor. The detector taking various characteristics into account is a nice add. Also the probes may fit better in the visual_jailbreak module as new probe classes than as a new module set.

Possible location would be:

probes.visual_jailbreak.JailbreakVText
probes.visual_jailbreak.JailbreakVImage

jmartin-tech · 2025-07-07T15:50:32Z

garak/probes/jailbreakv.py

+    def __init__(self, config_root=_config):
+        """Initializes the probe and loads JailbreakV data from Hugging Face or fallback prompts."""
+        super().__init__(config_root=config_root)
+        self.cache_dir = Path(_config.transient.cache_dir) / "data" / "jailbreakv"


This project is already well integrated with hugginface, instead of providing a custom cache_dir this can be removed and rely on the huggingface cache location.

Suggested change

self.cache_dir = Path(_config.transient.cache_dir) / "data" / "jailbreakv"

jmartin-tech · 2025-07-07T15:51:20Z

garak/probes/jailbreakv.py

+            dataset = load_dataset(
+                "JailbreakV-28K/JailBreakV-28k",
+                "JailBreakV_28K",
+                cache_dir=str(self.cache_dir / "huggingface_cache"),


Cache dir can simply rely on huggingface default location.

Suggested change

cache_dir=str(self.cache_dir / "huggingface_cache"),

jmartin-tech · 2025-07-07T16:03:32Z

garak/probes/jailbreakv.py

The class structure in this probe does not match expectations. Classes that extend garak.probes.Probe are each considered a unique executable probe. In the current implementation this class would produce:

probes.jailbreakv.JailbreakV probes.jailbreakv.JailbreakVText probes.jailbreakv.JailbreakVImage

The implementation however creates duplication as probes.jailbreakv.JailbreakV looks to be the superset of all probes in probes.jailbreakv.JailbreakVText and probes.jailbreakv.JailbreakVImage. If a user were to run:

garak -m nim.Vision -n some_vision_capable_model -p jailbreakv

The run would execute all three probes duplicating prompts as it executes.

To address this either the jailbreakv.JailbreakV class implementing the shared code should be a mixin that does not extend garak.probes.Probe exposing only probes.jailbreakv.JailbreakVText and probes.jailbreakv.JailbreakVImage for unique prompt sets, or the module should be reduced to a single probe that has DEFAULT_PARAMS to enable filtering and supported modality requirements.

jmartin-tech · 2025-07-07T16:07:17Z

garak/detectors/jailbreakv.py

+    def get_version(self) -> str:
+        """Get detector version for reporting
+
+        Returns:
+            Semantic version string
+        """
+        return "1.0.0"
+
+    def get_config(self) -> dict:
+        """Get detector configuration for audit logging
+
+        Returns:
+            Dictionary of configuration parameters
+        """
+        return {
+            "name": self.name,
+            "description": self.description,
+            "threshold": self.threshold,
+            "patterns_count": len(self.signals),
+            "version": self.get_version(),
+        }


What are these? Currently detectors do not have separate versions from the version of garak executed.

Suggested change

def get_version(self) -> str:

"""Get detector version for reporting

Returns:

Semantic version string

"""

return "1.0.0"

def get_config(self) -> dict:

"""Get detector configuration for audit logging

Returns:

Dictionary of configuration parameters

"""

return {

"name": self.name,

"description": self.description,

"threshold": self.threshold,

"patterns_count": len(self.signals),

"version": self.get_version(),

}

jmartin-tech · 2025-07-07T16:09:00Z

garak/detectors/jailbreakv.py

+    # Performance metrics for Garak reporting
+    precision = 0.85  # Precision from validation tests
+    recall = 0.80  # Recall from validation tests
+    accuracy = 0.82  # Overall accuracy
+


These values are not used.

Suggested change

# Performance metrics for Garak reporting

precision = 0.85 # Precision from validation tests

recall = 0.80 # Recall from validation tests

accuracy = 0.82 # Overall accuracy

jmartin-tech · 2025-07-07T17:36:28Z

garak/detectors/jailbreakv.py

+    # I/O specification
+    modality = {"out": {"text"}}  # Processes text outputs


Not needed, this is the default modality for a detectors.
See:

garak/garak/detectors/base.py

Lines 31 to 35 in 37d046b

# support mainstream any-to-any large models

# legal element for str list `modality['in']`: 'text', 'image', 'audio', 'video', '3d'

# refer to Table 1 in https://arxiv.org/abs/2401.13601

# we focus on LLM output for detectors

modality: dict = {"out": {"text"}}

Suggested change

# I/O specification

modality = {"out": {"text"}} # Processes text outputs

jmartin-tech · 2025-07-07T18:10:51Z

garak/probes/jailbreakv.py

+            dataset = load_dataset(
+                "JailbreakV-28K/JailBreakV-28k",
+                "JailBreakV_28K",
+                cache_dir=str(self.cache_dir / "huggingface_cache"),
+            )["JailBreakV_28K"]


It would be reasonable do a full download of the dataset repo vs just the base dataset table.

Suggested change

dataset = load_dataset(

"JailbreakV-28K/JailBreakV-28k",

"JailBreakV_28K",

cache_dir=str(self.cache_dir / "huggingface_cache"),

)["JailBreakV_28K"]

from huggingface_hub import snapshot_download

snapshot_download(repo_id="JailbreakV-28K/JailBreakV-28k", repo_type="dataset")

dataset = load_dataset(

"JailbreakV-28K/JailBreakV-28k",

"JailBreakV_28K",

cache_dir=str(self.cache_dir / "huggingface_cache"),

)["JailBreakV_28K"]

jmartin-tech · 2025-07-07T21:25:19Z

garak/probes/jailbreakv.py

+#!/usr/bin/env python3
+


Individual probes should not have #! entries as these are not entry points to execute the project.

Suggested change

#!/usr/bin/env python3

jmartin-tech · 2025-07-07T21:29:33Z

garak/detectors/jailbreakv.py

+# garak/detectors/jailbreakv.py
+


Not needed.

Suggested change

# garak/detectors/jailbreakv.py

jmartin-tech · 2025-07-07T21:30:26Z

garak/detectors/jailbreakv.py

Please consider adding pytest coverage for detection to better document what will pass or fail this detector.

MathisFranel and others added 9 commits May 22, 2025 00:19

Add Jailbreak V probe

6e7f69c

Merge branch 'NVIDIA:main' into main

3165e8c

Merge branch 'NVIDIA:main' into main

53477ca

Implement JailbreakV probe and detector for Garak

26301d6

Co-authored-by: Florian Cheviron [email protected] Co-authored-by: Mathis Franel [email protected]

Merge branch 'NVIDIA:main' into main

63c03d9

Changing the default detector

f24ea3b

Co-authored-by: Florian Cheviron [email protected] Co-authored-by: Mathis Franel [email protected]

Merge branch 'main' of github.com:MathisFranel/garakOteria

6696c01

code reformated with black

c9e9d4a

Co-authored-by: Florian Cheviron [email protected] Co-authored-by: Mathis Franel [email protected]

JailbreakV Probe

cd6834f

leondz added probes Content & activity of LLM probes detectors work on code that inherits from or manages Detector new plugin Describes an entirely new probe, detector, generator or harness labels Jun 26, 2025

jmartin-tech requested changes Jul 7, 2025

View reviewed changes

leondz marked this pull request as draft August 7, 2025 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Branch jailbreakv #1261

Branch jailbreakv #1261

N0xAh commented Jun 17, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

leondz commented Jun 26, 2025 •

edited

Loading

Uh oh!

jmartin-tech left a comment

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

jmartin-tech Jul 7, 2025

Uh oh!

Uh oh!

		# I/O specification
		modality = {"out": {"text"}} # Processes text outputs

	# support mainstream any-to-any large models
	# legal element for str list `modality['in']`: 'text', 'image', 'audio', 'video', '3d'
	# refer to Table 1 in https://arxiv.org/abs/2401.13601
	# we focus on LLM output for detectors
	modality: dict = {"out": {"text"}}

Branch jailbreakv #1261

Are you sure you want to change the base?

Branch jailbreakv #1261

Conversation

N0xAh commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this change do?

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

leondz commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

N0xAh commented Jun 17, 2025 •

edited

Loading

leondz commented Jun 26, 2025 •

edited

Loading