GitHub - franciellevargas/HateBR: HateBR is the first large-scale expert annotated dataset of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.

HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese

HateBR is the first large-scale expert annotated dataset of Brazilian Instagram comments for abusive language detection on the web and social media. The HateBR was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and 9 (nine) hate speech targets (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented outperforming the current literature dataset baselines for the Portuguese language. We hope that the proposed expert annotated dataset may foster research on hate speech detection in the Natural Language Processing area.

Update: The HateBR 2.0 and HateBRXplain versions are available.

This repository contains the corpus and the best models presented in the LREC's paper (see section "CITING / BIBTEX").

The following table describes in detail the binary class:

Offensive Language

class	label	total
offensive	1	3,500
non-offensive	0	3,500
Total		7,000

In addition, we also provide baseline machine learning results for both tasks: offensive language and hate speech detection. The best-obtained models are available here in .pkl files. File names are organized as [classification (offensive or hate)_representation (ngram or tfidf)_algorithms (nb, svm, mlp or lr)]. For example, the file offensive_tfidf_svm.pkl presents the model of offensive detection with tf-idf representation using the support vector machine algorithm.

CITING / BIBTEX

@inproceedings{vargas-etal-2022-hatebr, title = "{H}ate{BR}: A Large Expert Annotated Corpus of {B}razilian {I}nstagram Comments for Offensive Language and Hate Speech Detection", author = "Vargas, Francielle and Carvalho, Isabelle and Rodrigues de G{\'o}es, Fabiana and Pardo, Thiago and Benevenuto, Fabr{\'\i}cio", booktitle = "Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)", year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.777", pages = "7174--7183", }

@article{Vargas_Carvalho_Pardo_Benevenuto_2024, author={Vargas, Francielle and Carvalho, Isabelle and Pardo, Thiago A. S. and Benevenuto, Fabrício}, title={Context-aware and expert data resources for Brazilian Portuguese hate speech detection}, DOI={10.1017/nlp.2024.18}, journal={Natural Language Processing},
year={2024}, pages={1–22}, url={https://www.cambridge.org/core/journals/natural-language-processing/article/contextaware-and-expert-data-resources-for-brazilian-portuguese-hate-speech-detection/7D9019ED5471CD16E320EBED06A6E923#}, }

@inproceedings{vargas-etal-2022-hatebr, title = "HateBRXplain: A Benchmark Dataset with Human-Annotated Rationales for Explainable Hate Speech Detection in Brazilian Portuguese", author = "Salles, Isadora Vargas, Francielle and Benevenuto, Fabr{\'\i}cio", booktitle = "Proceedings of the 31th International Conference on Computational Linguistics (COLING 2025)", year = "2025", address = "Abu Dhabi, UAE", publisher = "Association for Computational Linguistics", url = "", pages = "", }

Name		Name	Last commit message	Last commit date
Latest commit History 392 Commits
annotators		annotators
dataset		dataset
models		models
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese

Update: The HateBR 2.0 and HateBRXplain versions are available.

CITING / BIBTEX

FUNDING

About

Releases 5

Packages

Contributors 2

franciellevargas/HateBR

Folders and files

Latest commit

History

Repository files navigation

HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese

Update: The HateBR 2.0 and HateBRXplain versions are available.

CITING / BIBTEX

FUNDING

About

Topics

Resources

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 2

Packages