-
Notifications
You must be signed in to change notification settings - Fork 596
FEAT: Psychosocial Harms Red Teaming Automation #1138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| in order to score the conversation as a whole. | ||
| """ | ||
|
|
||
| _default_validator: ScorerPromptValidator = ScorerPromptValidator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to validate these are single part text conversations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default_validator = ScorerPromptValidator(
supported_data_types=["text"],
max_pieces_in_response=1,
enforce_all_pieces_valid=True,
)
|
|
||
| def __init__( | ||
| self, | ||
| scorer: FloatScaleScorer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put scorer after *
| sorted_conversation = sorted(conversation, key=lambda x: x.message_pieces[0].timestamp) | ||
|
|
||
| # Goes through each message in the conversation and appends user and assistant messages | ||
| for message in sorted_conversation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to handle multi-modal and other data types as well as multi-part messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend refactoring to add an _expand_conversation_to_string function that takes this logic out. And adding unit tests for that to make sure it works well.
| conversation_text = "" | ||
|
|
||
| # Sort by timestamp to get chronological order | ||
| sorted_conversation = sorted(conversation, key=lambda x: x.message_pieces[0].timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_conversation should already be sorted
|
|
||
| return scores | ||
|
|
||
| async def score_text_async(self, text: str, *, objective: Optional[str] = None) -> list[Score]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be removed
|
|
||
| # Create a modified copy of the message piece with the full conversation text | ||
| # This preserves the original message piece metadata while replacing the content | ||
| modified_piece = MessagePiece( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating a new conversation like this, I recommend making use of score_text_async. Pseudocode:
# Build conversation text using helper method
conversation_text = self._expand_conversation_to_string(conversation)
# Use the scorer's built-in score_text_async method
scores = await self._scorer.score_text_async(text=conversation_text, objective=objective)
# Update the scores to reference the original message piece
for score in scores:
score.prompt_request_response_id = message_piece.id
| super().__init__(validator=validator or self._default_validator) | ||
| self._scorer = scorer | ||
|
|
||
| async def _score_piece_async(self, message_piece: MessagePiece, *, objective: Optional[str] = None) -> list[Score]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, I recommend overwriting at the _score_async level (not _score_piece_async). We're working with the whole history and need to not do piece by piece
| super().__init__(validator=ScorerPromptValidator()) | ||
|
|
||
| if threshold <= 0 or threshold >= 1: | ||
| if threshold < 0 or threshold > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change this back, because if it is 0 that also doesn't make sense (this would always return true)
Description
Adding notebook for red teaming for psyschosocial harms using a multi-step approach of modeling user behaviors, contexts, and evaluations
Tests and Documentation
Ran notebook