Skip to content

Conversation

@0xm00n
Copy link
Contributor

@0xm00n 0xm00n commented Nov 5, 2025

Description

This PR adds support for the Anthropic model-written-evals dataset to PyRIT. The model-written-evals dataset contains 154 evaluation datasets designed to test LLM behaviors across 4 main categories: persona traits, sycophancy, advanced AI risks, and gender bias. The evaluations use language models to automatically generate test cases across multiple behavioral dimensions.

Dataset: https://github.com/anthropics/evals

Associated Paper: https://arxiv.org/abs/2212.09251

Work Completed

  • Implemented the fetch_anthropic_evals_dataset() function in anthropic_evals_dataset.py
  • Added unit tests in test_anthropic_evals_dataset.py (12 test cases)
  • Added integration test in test_fetch_datasets.py
  • Updated API documentation in api.rst
  • Registered function in pyrit/datasets/__init__.py

Related Issue

Contributes to issue #450

0xm00n and others added 5 commits November 2, 2025 20:10
[pull] main from Azure:main
  - Implement fetch_sorry_bench_dataset() for sorry-bench/sorry-bench-202503 on Hugging Face
  - Support filtering by 44 categories and 21 prompt style linguistic mutations
  - Default to base prompts only
  - Add unit tests (10 tests)
  - Add integration test
  - Update API documentation
…dation

  - Change prompt_styles (list) to prompt_style (string) for simpler API
  - Add input validation for categories and prompt_style
  - Add VALID_CATEGORIES and VALID_PROMPT_STYLES for input validation
  - Fix exception handling to preserve stack traces
  - Update tests to match new API (12 tests, all passing)
  Implements support for Anthropic's model-written-evals dataset with 154 evaluation datasets across 4 categories.

  Changes:
  - Add fetch_anthropic_evals_dataset() function with category filtering
  - Support for persona, sycophancy, advanced-ai-risk, and winogenerated categories
  - Fetches directly from GitHub repository using API for file discovery
  - Add unit tests (12 tests)
  - Add integration test
  - Update API docs
@hannahwestra25
Copy link
Contributor

Thanks for adding these! made a few stylistic suggestions!

Copy link
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Looking forward to having this dataset fetcher in PyRIT 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants