Skip to content

Conversation

@jbolor21
Copy link
Contributor

@jbolor21 jbolor21 commented Oct 17, 2025

Description

Adding notebook for red teaming for psyschosocial harms using a multi-step approach of modeling user behaviors, contexts, and evaluations

  • Created new conversation scorer to score the entire conversation
  • Added a toy dataset with sample multi-turn conversations
  • Added a sample attack strategy yaml file modeling a user escalation towards crisis

Tests and Documentation

Ran notebook

@jbolor21 jbolor21 marked this pull request as draft October 17, 2025 17:57
@jbolor21 jbolor21 changed the title [DRAFT] Psychosocial Harms Red Teaming Automation FEAT: Psychosocial Harms Red Teaming Automation Oct 20, 2025
@jbolor21 jbolor21 marked this pull request as ready for review October 20, 2025 21:25
@rlundeen2 rlundeen2 self-assigned this Oct 24, 2025
in order to score the conversation as a whole.
"""

_default_validator: ScorerPromptValidator = ScorerPromptValidator()
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to validate these are single part text conversations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_validator = ScorerPromptValidator(
supported_data_types=["text"],
max_pieces_in_response=1,
enforce_all_pieces_valid=True,
)


def __init__(
self,
scorer: FloatScaleScorer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put scorer after *

sorted_conversation = sorted(conversation, key=lambda x: x.message_pieces[0].timestamp)

# Goes through each message in the conversation and appends user and assistant messages
for message in sorted_conversation:
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to handle multi-modal and other data types as well as multi-part messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend refactoring to add an _expand_conversation_to_string function that takes this logic out. And adding unit tests for that to make sure it works well.

conversation_text = ""

# Sort by timestamp to get chronological order
sorted_conversation = sorted(conversation, key=lambda x: x.message_pieces[0].timestamp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_conversation should already be sorted


return scores

async def score_text_async(self, text: str, *, objective: Optional[str] = None) -> list[Score]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be removed


# Create a modified copy of the message piece with the full conversation text
# This preserves the original message piece metadata while replacing the content
modified_piece = MessagePiece(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a new conversation like this, I recommend making use of score_text_async. Pseudocode:

# Build conversation text using helper method
conversation_text = self._expand_conversation_to_string(conversation)

# Use the scorer's built-in score_text_async method
scores = await self._scorer.score_text_async(text=conversation_text, objective=objective)

# Update the scores to reference the original message piece
for score in scores:
    score.prompt_request_response_id = message_piece.id

super().__init__(validator=validator or self._default_validator)
self._scorer = scorer

async def _score_piece_async(self, message_piece: MessagePiece, *, objective: Optional[str] = None) -> list[Score]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, I recommend overwriting at the _score_async level (not _score_piece_async). We're working with the whole history and need to not do piece by piece

super().__init__(validator=ScorerPromptValidator())

if threshold <= 0 or threshold >= 1:
if threshold < 0 or threshold > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this back, because if it is 0 that also doesn't make sense (this would always return true)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants