Annotator Policy Models

We introduce Annotator Policy Models (APMs), interpretable models that learn annotators' internal safety policies from labeling behavior alone, making annotator reasoning visible and comparable without additional annotation effort.

Applying APMs to LLM and human annotations, we demonstrate two core applications: (1) surfacing policy ambiguity by revealing how annotators interpret safety instructions differently, and (2) surfacing value pluralism by uncovering systematic differences in safety priorities across demographic groups. Together, these capabilities support more targeted, transparent, and inclusive safety policy design.

This code accompanies the research paper:

Understanding Annotator Safety Policy with Interpretability
Alex Oesterling, Donghao Ren, Yannick Assogba, Dominik Moritz, Sunnie S. Y. Kim, Leon Gatys, Fred Hohman
FAccT, 2026.
Paper, GitHub

Getting Started

Requires Python 3.10+ and uv.

uv sync

See DOCUMENTATION.md for full setup instructions, data preparation, and how to run experiments.

Repo Structure

safe_dictionary_learning — core library with decision function models (Non-Negative Logistic Regression, DNF) and text embedding/feature generation.
experiments — code to replicate the experimental results from the paper.
figures — notebooks and data for generating paper figures.

Contributing

When making contributions, refer to the CONTRIBUTING guidelines and read the CODE OF CONDUCT.

BibTeX

To cite our paper, please use:

@inproceedings{oesterling2026understanding,
    title={Understanding Annotator Safety Policy with Interpretability},
    author={Oesterling, Alex and Ren, Donghao and Assogba, Yannick and Moritz, Dominik and Kim, Sunnie S. Y. and Gatys, Leon and Hohman, Fred},
    booktitle={ACM Conference on Fairness, Accountability, and Transparency},
    year={2026},
    doi={10.1145/3805689.3806472}
}

License

This code is released under the LICENSE terms.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
experiments		experiments
safe_dictionary_learning		safe_dictionary_learning
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
DOCUMENTATION.md		DOCUMENTATION.md
LICENSE		LICENSE
README.md		README.md
pylintrc		pylintrc
pyproject.toml		pyproject.toml
teaser.png		teaser.png
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Annotator Policy Models

Getting Started

Repo Structure

Contributing

BibTeX

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Annotator Policy Models

Getting Started

Repo Structure

Contributing

BibTeX

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages