You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+10-4
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,9 @@
1
1
# [ToxiGen](http://arxiv.org/abs/2203.09509): A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection 
2
+
3
+
## [June 17, 2024] Update: Releasing 27,450 human annotations.
4
+
You can now download the raw human annotations via `load_dataset("toxigen/toxigen-data", "annotations")`. These data include 27,450 responses from all mechanical turk annotations. All WorkerIDs have been hashed to further anonymize the annotators.
5
+
6
+
## Overview
2
7
This repository includes all necessary components that we used to generate ToxiGen dataset which contains implicitly toxic and benign sentences mentioning 13 minority groups. It includes a tool referred to as ALICE to stress test a given off-the-shelf content moderation system and iteratively improve it across these minority groups.
3
8
4
9
With release of the source codes and prompt seeds for this work we hope to encourage and engage community to contribute to it by for example adding prompt seeds and generating data for minority groups that are not covered in our dataset or even scenarios we have not covered to continuously iterate and improve it (e.g., by submitting PR to this repository).
@@ -13,14 +18,15 @@ This repository includes two methods for generating new sentences given a large
13
18
14
19
## Downloading ToxiGen
15
20
16
-
You can download ToxiGen using HuggingFace 🤗 from [this webpage](https://huggingface.co/datasets/skg/toxigen-data) or through python:
21
+
ToxiGen is available on [HuggingFace](https://huggingface.co/datasets/toxigen/toxigen-data).
17
22
18
-
To run these commands you'll need to create a Hugging Face auth_token by following [these](https://huggingface.co/docs/hub/security-tokens) steps. As discussed below, you can manually use `use_auth_token={auth_token}` or register your token with your transformers installation via huggingface-cli.
23
+
To download with python, you'll need to create a Hugging Face auth_token by following [these instructions](https://huggingface.co/docs/hub/security-tokens). As discussed below, you can manually use `use_auth_token={auth_token}` or register your token with your transformers installation via huggingface-cli.
19
24
20
25
```
21
26
from datasets import load_dataset
22
-
TG_data = load_dataset("skg/toxigen-data", name="train", use_auth_token=True) # 250k training examples
23
-
TG_annotations = load_dataset("skg/toxigen-data", name="annotated", use_auth_token=True) # Human study
27
+
train_data = load_dataset("toxigen/toxigen-data", name="train", use_auth_token=True) # 250k training examples
28
+
annotated_data = load_dataset("toxigen/toxigen-data", name="annotated", use_auth_token=True) # Human study
29
+
raw_annotations = load_dataset("toxigen/toxigen-data", name="annotations", use_auth_token=True) # Raw Human study
24
30
```
25
31
26
32
**Optional, but helpful**: Please fill out [this form](https://forms.office.com/r/r6VXX8f8vh) so we can track how the community uses ToxiGen.
0 commit comments