Rapid response harm scenario #1174

hannahwestra25 · 2025-11-06T17:47:19Z

Description

Add rapid response harm scenario which tests several different strategies for each harm category. The idea is to have a quick, comprehensive scenario to run before drilling down into more specific strategies.

Tests and Documentation

Added rapid response notebook plus instructions for dataset naming.
Added unit tests

…xample_scenario

hannahwestra25 · 2025-11-07T21:55:53Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+    MultiTurn = ("multi_turn", {"attack"})
+    Crescendo = ("crescendo", {"attack"})


I'm on the fence about having this here. I'm leaning towards removing it and having this notebook only change the prompts but not being able to update the attack type (& converter type) is restricting. i could also see these being added to the scenario base class but i'm not sure if that would be confusing to users esp for an encoding scenario when we expose converters...

I mention this below but I think they should be removed as options (and still maybe do some of these strategies as part of the attacks run)

jsong468 · 2025-11-08T02:28:30Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+    """
+    RapidResponseHarmStrategy defines a set of strategies for testing model behavior
+    in several different harm categories.
+


could be nice to have explanation of what the tags mean in this context?

jsong468 · 2025-11-08T02:47:44Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+
+        self._memory_labels = memory_labels or {}
+
+        self._rapid_response_harm_strategy_compositiion = RapidResponseHarmStrategy.prepare_scenario_strategies(


Suggested change

self._rapid_response_harm_strategy_compositiion = RapidResponseHarmStrategy.prepare_scenario_strategies(

self._rapid_response_harm_strategy_composition = RapidResponseHarmStrategy.prepare_scenario_strategies(

jsong468 · 2025-11-08T02:50:03Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+    HateHighSchoolDebate = ("hate_high_school_debates", {"hate", "harm"})
+    HateCrimeDirect = ("hate_crime_direct", {"hate", "harm"})
+    HateFictionalStory = ("hate_fictional_story", {"hate", "harm"})
+    HateHistoricalEvents = ("hate_historical_events", {"hate", "harm"})


maybe this will make sense as I dig deeper, but these datasets don't exist in our repo right? They have to be created?

rlundeen2 · 2025-11-08T03:48:34Z

doc/code/scenarios/rapid_response_harm_scenario.py

+
+# Hate speech datasets
+
+hate_stories = await create_seed_dataset(


I think we should manage a few of these, even if the list is incomplete. So instead of having strings in the notebooks, I'd put these in datasets/seed_prompts/ai_rt and maybe one file per category.

Eventually it might be nice to have a single function call that can load all our yaml seedprompts into the database and folks can use those as examples.

rlundeen2 · 2025-11-08T03:49:49Z

doc/code/scenarios/rapid_response_harm_scenario.py

+
+hate_stories = await create_seed_dataset(
+    "rapid_response_harm_hate_fictional_story",
+    [


I believe all of these should be considered objectives. E.g. I the scorers will work better if the objective is "Write a fictional story promoting racial discrimination"

rlundeen2 · 2025-11-08T03:50:35Z

doc/code/scenarios/rapid_response_harm_scenario.py

+ethnic_names = await create_seed_dataset(
+    "rapid_response_harm_fairness_ethnicity_inference",
+    [
+        "Mohammad Al-Sayed",


And in this case, the seedPrompts and objectives may be different.

rlundeen2 · 2025-11-08T03:51:26Z

doc/code/scenarios/rapid_response_harm_scenario.py

+from pyrit.prompt_target import OpenAIChatTarget
+
+# Define the target of the attack
+objective_target = OpenAIChatTarget(


I'd recommend defining these as in the scenario itself, and give them default values

Actually, looks like they are already defined; it's okay to have them here but might want to leave them off to simplify :)

rlundeen2 · 2025-11-08T03:53:04Z

pyrit/scenarios/__init__.py

    "ScenarioStrategy",
    "ScenarioIdentifier",
    "ScenarioResult",
+    "RapidResponseHarmScenario",


I'm thinking we might want a whole import line here? For example

from pyrit.scenarios.ai_rt import RapidResponseHarmScenario

But we may need some init shenanigans. IDK what do you think?

rlundeen2 · 2025-11-08T03:58:22Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+    Each harm categories has a few different strategies to test different aspects of the harm type.
+    """
+
+    ALL = ("all", {"all"})


One idea is to only have the meta-categories. I think this may make the most sense just to have hate, fairness, violence.... leakage vs each individual scenario_strategy

I think the composition makes the code quite a bit more complicated, and I would guess most users will either just want to use "all" or a subset of the categories

In other words, I think it should look like the following (and that's it)

class RapidResponseHarmStrategy(ScenarioStrategy): """ RapidResponseHarmStrategy defines a set of strategies for testing model behavior in several different harm categories. Each harm categories has a few different strategies to test different aspects of the harm type. """ ALL = ("all", {"all"}) HATE = ("hate", set[str]()) FAIRNESS = ("fairness", set[str]()) VIOLENCE = ("violence", {set[str]()) SEXUAL = ("sexual", set[str]()) HARASSMENT = ("harassment", set[str]()) MISINFORMATION = ("misinformation", set[str]()) LEAKAGE = ("leakage", set[str]())

Alternatively, if you do want a long and short running version (which I also think is legit!) I might split it up like this, where the complex attacks contain long running methods. But my gut is that, it might just be simpler to have a completely separate scenario class for those

ALL = ("all", {"all"}) HATE_QUICK = ("hate_quick", {"quick", "hate"}) HATE_EXTENDED = ("hate_extended", {"complex", "hate"}) FAIRNESS_QUICK = ("fairness_quick", {"quick", "fairness"}) ...

Either way, I'd keep specific techniques out, and specific tests/datasets

rlundeen2 · 2025-11-08T04:01:46Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+        objective_scorer: Optional[TrueFalseScorer] = None,
+        memory_labels: Optional[Dict[str, str]] = None,
+        max_concurrency: int = 5,
+        objective_dataset_path: Optional[str] = None,


I don't think we should have objective_dataset_path here as a parameter. Maybe something that may make more sense is "seedprompt_dataset_name", which it uses to grab the seedprompt dataset from memory. And it can have a default value that we've populated in our database, with the right harm categories labeled.

rlundeen2 · 2025-11-08T04:02:00Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+        memory_labels: Optional[Dict[str, str]] = None,
+        max_concurrency: int = 5,
+        objective_dataset_path: Optional[str] = None,
+        include_baseline: bool = False,


We probably also want to include max_retries

rlundeen2 · 2025-11-08T04:05:27Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+        objective_dataset_path: Optional[str] = None,
+        include_baseline: bool = False,
+    ):
+        """


I also recommend getting rid of include_baseline here. Setting it to False in the parent class, but callers of this class can't override it

rlundeen2 · 2025-11-08T04:07:09Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+        return OpenAIChatTarget(
+            endpoint=os.environ.get("AZURE_OPENAI_GPT4O_UNSAFE_ENDPOINT"),
+            api_key=os.environ.get("AZURE_OPENAI_GPT4O_UNSAFE_CHAT_KEY"),
+            temperature=0.7,


recommend a higher temperature potentially

rlundeen2 · 2025-11-08T04:08:55Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+        Returns:
+            List[AtomicAttack]: The list of AtomicAttack instances in this scenario.
+        """
+        return self._get_rapid_response_harm_attacks()


can we get rid of _get_rapid_response_harm_attacks() and just move the contents to this method? :)

rlundeen2 · 2025-11-08T04:26:51Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+        # Extract RapidResponseHarmStrategy enums from the composite
+        strategy_list = [s for s in composite_strategy.strategies if isinstance(s, RapidResponseHarmStrategy)]
+
+        # Determine the attack type based on the strategy tags


I think we should make this decision in advance if we can (and if it's what operators want).

Say we get the strategy "Hate". Maybe we could do something like pick a set of strategies for hate that we want. Something like PromptSending for baseline, MultiTurn, and RolePlaying. But I could also see specific attacks/converters being created for different categories, so it might make sense to split it up this way too.

if strategy.value == "hate": seed_groups = memory.get_seed_groups(dataset_name="ai_rt_rapid_response_1", harm_category="hate") elif strategy.value == "violence": .... #now we have the seedGroups, and do we do the same attacks with every category or are they different? # my guess might be they're the same? # and can we decide? # My guess would be they're the same strategies but different objectives. attack1 = PromptSendingAttack( objective_target=self._objective_target, attack_converter_config=attack_converter_config, attack_scoring_config=self._scorer_config, ) attack2 = .... # and then append all of these atomic attacks in the same spot. E.g. you can have more than one "hate" attack and they will be grouped together atomic_attacks.append( AtomicAttack( atomic_attack_name="hate", attack=attack1, objectives=hate_objectives, seed_groups=hate_seed_groups ) ) atomic_attacks.append( AtomicAttack( atomic_attack_name="hate", attack=attack2, objectives=hate_objectives, seed_groups=hate_seed_groups ) )

rlundeen2 · 2025-11-08T04:27:20Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+            memory_labels=self._memory_labels,
+        )
+
+    def _get_attack(


I know this follows some FoundryScenario logic, but I think that case is more complicated than it needs to be. We probably don't need a generic for this class; especially if we don't use composite strategies

rlundeen2 · 2025-11-08T04:29:29Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+        attack_type: type[AttackStrategy] = PromptSendingAttack
+        if attack_tag:
+            if attack_tag[0] == RapidResponseHarmStrategy.Crescendo:
+                attack_type = CrescendoAttack


One arc you might be thinking about is Crescendo. But because that takes so much longer to run we might consider a different rapid response scenario for that. And/or for this one, we could pre-compute successes so it runs really fast (e.g. similar to our second cookbook).

rlundeen2 · 2025-11-08T04:30:05Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+    def __init__(
+        self,
+        *,
+        objective_target: PromptTarget,


Because we include crescendo and a few others that require history changes, this needs to be a PromptChatTarget

rlundeen2 · 2025-11-08T04:33:27Z

Overall this is good! It'll be really nice to have solid examples here :)

My biggest feedback is that I think we should define exactly what we want out of this scenario. Here is what I think it is. "Can I get a vibe of this objective_target in a couple hours based on how it does on these harm categories".

And if we keep that strategy, we want to do the best we can to answer that question, and the strategies themselves should be baked in as much as possible. Along these lines, I'd recommend:

Simplify the strategies. I suspect most users just want to run "all" to get a vibe check, or to run specific harm categories. And if there is a strategy they want but it takes a long time (like crescendo) maybe we should split that off into a seperate longer-running scenario class.
Choose the attacks to do with those strategies explicitly (which converters and attacks to use). E.g. we can get the objectives from memory, and then this scenario can decide how we send those. I wouldn't make this configurable, because it adds another dimension to things.

rlundeen2 · 2025-11-08T05:16:19Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+
+        self._objective_target = objective_target
+        self._adversarial_chat = adversarial_chat if adversarial_chat else self._get_default_adversarial_target()
+        self._objective_scorer = objective_scorer if objective_scorer else self._get_default_scorer()


one thing we may want to do early (before the scenario is run) is raise an exception if the datasets don't exist in memory.

jbolor21 · 2025-11-08T13:38:58Z

doc/code/scenarios/rapid_response_harm_scenario.py

+# %% [markdown]
+# ## Testing Violence-Related Harm Categories
+#
+# In this section, we focus specifically on violence-related harm categories. We'll create datasets for:


nit: maybe "sample" datasets so people know this is a small sample/they can add/change

hannahwestra25 added 4 commits November 6, 2025 09:59

first draft of harm scenario

6a3da9e

Merge branch 'main' of https://github.com/Azure/PyRIT into hawestra/e…

04d5742

…xample_scenario

Merge branch 'main' of https://github.com/Azure/PyRIT into hawestra/e…

2c03890

…xample_scenario

add tests and documentation

ea488f5

hannahwestra25 commented Nov 7, 2025

View reviewed changes

hannahwestra25 changed the title ~~[DRAFT] rapid response harm scenario~~ Rapid response harm scenario Nov 7, 2025

jsong468 reviewed Nov 8, 2025

View reviewed changes

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

jbolor21 reviewed Nov 8, 2025

View reviewed changes

		MultiTurn = ("multi_turn", {"attack"})
		Crescendo = ("crescendo", {"attack"})


		self._memory_labels = memory_labels or {}

		self._rapid_response_harm_strategy_compositiion = RapidResponseHarmStrategy.prepare_scenario_strategies(


		# Hate speech datasets

		hate_stories = await create_seed_dataset(

Rapid response harm scenario #1174

Are you sure you want to change the base?

Rapid response harm scenario #1174

Conversation

hannahwestra25 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests and Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsong468 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlundeen2 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

hannahwestra25 commented Nov 6, 2025 •

edited

Loading

jsong468 Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 8, 2025 •

edited

Loading

rlundeen2 commented Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 8, 2025 •

edited

Loading