FEAT: Psychosocial Harms Red Teaming Automation #1138

jbolor21 · 2025-10-17T17:56:52Z

Description

Adding notebook for red teaming for psyschosocial harms using a multi-step approach of modeling user behaviors, contexts, and evaluations

Created new conversation scorer to score the entire conversation
Added a toy dataset with sample multi-turn conversations
Added a sample attack strategy yaml file modeling a user escalation towards crisis

Tests and Documentation

Ran notebook

…chosocial

pyrit/score/float_scale/conversation_scorer.py

pyrit/datasets/executors/crescendo/escalation_crisis_v1.yaml

…chosocial

…cial

rlundeen2 · 2025-11-06T00:02:55Z

pyrit/score/float_scale/conversation_scorer.py

+    in order to score the conversation as a whole.
+    """
+
+    _default_validator: ScorerPromptValidator = ScorerPromptValidator()


You probably want to validate these are single part text conversations

default_validator = ScorerPromptValidator(
supported_data_types=["text"],
max_pieces_in_response=1,
enforce_all_pieces_valid=True,
)

rlundeen2 · 2025-11-06T00:03:15Z

pyrit/score/float_scale/conversation_scorer.py

+
+    def __init__(
+        self,
+        scorer: FloatScaleScorer,


I'd put scorer after *

rlundeen2 · 2025-11-06T00:09:42Z

pyrit/score/float_scale/conversation_scorer.py

+        sorted_conversation = sorted(conversation, key=lambda x: x.message_pieces[0].timestamp)
+
+        # Goes through each message in the conversation and appends user and assistant messages
+        for message in sorted_conversation:


need to handle multi-modal and other data types as well as multi-part messages.

I recommend refactoring to add an _expand_conversation_to_string function that takes this logic out. And adding unit tests for that to make sure it works well.

rlundeen2 · 2025-11-06T00:10:00Z

pyrit/score/float_scale/conversation_scorer.py

+        conversation_text = ""
+
+        # Sort by timestamp to get chronological order
+        sorted_conversation = sorted(conversation, key=lambda x: x.message_pieces[0].timestamp)


get_conversation should already be sorted

rlundeen2 · 2025-11-06T00:14:42Z

pyrit/score/float_scale/conversation_scorer.py

+
+        return scores
+
+    async def score_text_async(self, text: str, *, objective: Optional[str] = None) -> list[Score]:


this should be removed

rlundeen2 · 2025-11-06T00:20:02Z

pyrit/score/float_scale/conversation_scorer.py

+
+        # Create a modified copy of the message piece with the full conversation text
+        # This preserves the original message piece metadata while replacing the content
+        modified_piece = MessagePiece(


Rather than creating a new conversation like this, I recommend making use of score_text_async. Pseudocode:

# Build conversation text using helper method conversation_text = self._expand_conversation_to_string(conversation) # Use the scorer's built-in score_text_async method scores = await self._scorer.score_text_async(text=conversation_text, objective=objective) # Update the scores to reference the original message piece for score in scores: score.prompt_request_response_id = message_piece.id

rlundeen2 · 2025-11-06T00:23:23Z

pyrit/score/float_scale/conversation_scorer.py

+        super().__init__(validator=validator or self._default_validator)
+        self._scorer = scorer
+
+    async def _score_piece_async(self, message_piece: MessagePiece, *, objective: Optional[str] = None) -> list[Score]:


For this, I recommend overwriting at the _score_async level (not _score_piece_async). We're working with the whole history and need to not do piece by piece

rlundeen2 · 2025-11-06T00:25:32Z

pyrit/score/true_false/float_scale_threshold_scorer.py

        super().__init__(validator=ScorerPromptValidator())

-        if threshold <= 0 or threshold >= 1:
+        if threshold < 0 or threshold > 1:


Maybe change this back, because if it is 0 that also doesn't make sense (this would always return true)

initial commit adding notebook and supplementary files

a33a8fe

jbolor21 marked this pull request as draft October 17, 2025 17:57

Bolor added 2 commits October 17, 2025 11:00

notebook edits

4e6a4c5

created custom scorer

f651a47

jbolor21 changed the title ~~[DRAFT] Psychosocial Harms Red Teaming Automation~~ FEAT: Psychosocial Harms Red Teaming Automation Oct 20, 2025

jbolor21 marked this pull request as ready for review October 20, 2025 21:25

Bolor added 4 commits October 20, 2025 14:30

minor formatting

2021a01

minor formatting

63cde6e

Merge remote-tracking branch 'origin/main' into users/bjagdagdorj/psy…

cda8352

…chosocial

merging in latest changes to main

75ca77e

rlundeen2 reviewed Oct 24, 2025

View reviewed changes

pyrit/score/float_scale/conversation_scorer.py Show resolved Hide resolved

rlundeen2 reviewed Oct 24, 2025

View reviewed changes

pyrit/datasets/executors/crescendo/escalation_crisis_v1.yaml Show resolved Hide resolved

rlundeen2 self-assigned this Oct 24, 2025

Bolor added 9 commits October 27, 2025 16:15

merge conflict

847c912

merging in scorer changes

2a7385a

Merge remote-tracking branch 'origin/main' into users/bjagdagdorj/psy…

7de75d7

…chosocial

precommit fixes

2c2c208

Merge remote-tracking branch 'origin' into users/bjagdagdorj/psychoso…

a08dabb

…cial

move attack file

670cf2f

adding unit tests

4318550

fix unit test

93638aa

Merge remote-tracking branch 'origin' into users/bjagdagdorj/psychoso…

8fef27e

…cial

rlundeen2 reviewed Nov 6, 2025

View reviewed changes

pyrit/score/float_scale/conversation_scorer.py

def __init__(

self,

scorer: FloatScaleScorer,

Copy link

Contributor

rlundeen2 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put scorer after *

rlundeen2 reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FEAT: Psychosocial Harms Red Teaming Automation #1138

FEAT: Psychosocial Harms Red Teaming Automation #1138

Uh oh!

jbolor21 commented Oct 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

rlundeen2 Nov 6, 2025 •

edited

Loading

Uh oh!

rlundeen2 Nov 6, 2025

Uh oh!

rlundeen2 Nov 6, 2025

Uh oh!

rlundeen2 Nov 6, 2025 •

edited

Loading

Uh oh!

rlundeen2 Nov 6, 2025

Uh oh!

rlundeen2 Nov 6, 2025

Uh oh!

rlundeen2 Nov 6, 2025

Uh oh!

rlundeen2 Nov 6, 2025

Uh oh!

rlundeen2 Nov 6, 2025

Uh oh!

rlundeen2 Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		return scores

		async def score_text_async(self, text: str, *, objective: Optional[str] = None) -> list[Score]:

FEAT: Psychosocial Harms Red Teaming Automation #1138

Are you sure you want to change the base?

FEAT: Psychosocial Harms Red Teaming Automation #1138

Uh oh!

Conversation

jbolor21 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests and Documentation

Uh oh!

Uh oh!

Uh oh!

rlundeen2 Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jbolor21 commented Oct 17, 2025 •

edited

Loading

rlundeen2 Nov 6, 2025 •

edited

Loading

rlundeen2 Nov 6, 2025 •

edited

Loading