Political Misogynistic Discourse Monitor (PMDM)

Español | Português

Political Misogynistic Discourse Monitor (PMDM)

Political Misogynistic Discourse Monitor is a web application and API that detects hate speech against women in Spanish and Portuguese.

This project is part of the 2021 JournalismAI Collab Challenges, a global initiative that brings together media organisations to explore innovative solutions to improve journalism via the use of AI technologies. It was developed as part of the Americas cohort of the Collab Challenges that focused on How might we use AI technologies to innovate newsgathering and investigative reporting techniques? in collaboration between AzMina (Brazil), La Nación (Argentina), CLIP (Colombia) and Data Crítica (México) with the support of the Knight Lab at Northwestern University.

JournalismAI is a project of Polis – the journalism think-tank at the London School of Economics and Political Science – and it’s sponsored by the Google News Initiative. If you want to know more about the Collab Challenges and other JournalismAI activities, sign up for the newsletter or get in touch with the team via [email protected]

Team members:

We would like to thank Ivan Vladimir for all his help developing the software and the web application. We also want to acknowledge IIMAS for hosting the project.

Introduction

This collaboration is an attempt to accelerate the development of MonitorA, a project by AzMina with InternetLab and Institute Update, which gathered evidence and insight of systematic misogynistic attacks of women candidates in Brazilian local elections of 2020.

According to the report Violence Against Women in Politics, this kind of violence is a deterrent to the participation of women in the political sphere, where women from marginalized communities are disproportionately affected. UN Women states that in Latin America’s context, women hold barely 30% of parliamentary seats. In addition, UN Women stands out that “gender equality in the highest positions of power will not be reached for another 130 years”. The mentioned facts lead us to analyze how the violence against women is perpetrated and has an impact in their participation. We want to report this kind of disinformation and attacks throughout Latin America in an effort to motivate new narratives where women have a safe space in their involvement in politics.

For the above reason, although this AI model is able to identify violence against women in general, we want to focus on misogyny in political discourse as a case study in Latin America. In our project, we support that the automation of detecting misogynistic discourse is just a tool to help identifying attacks against women among a large volume of data on Twitter, so the system highlights content that can be analyzed by a human moderator afterward.

Data

Since the collaborators are from Latin American countries, this model was trained with Spanish and Portuguese tweets posted from 2020 to 2021. We retrieved 4179 tweets from Twitter in 'csv' format.

There are missing 270 tweets from the database we used to train the model and the database we share in this repository since we couldn't recover the IDs from those tweets. All the amounts from the data analysis belong to the database training.

database training	database repository
4179	3909

Corpus Creation:

We created a dictionary on Spanish and one on Portuguese with misogynistic terms and phrases. Along with that, we made a list of usernames for trendy politicians. However, we considered those accounts wouldn't be inclusive enough, so we decided to make a second list exclusively for diverse women (black, indigenous and LGBTQIA+) politicians, journalists and activists from Brazil, Argentina, Colombia and Mexico. Therefore tweets mentioning both lists of usernames were collected from Twitter using Meltwater and filtered by the dictionaries with regular expressions.

Labelling:

The data file includes three columns:

ID: As Twitter's policy prevents from sharing tweets messages, we only included the ID from each tweet, considering that IDs are allowed to be downloadable and can be transformed into the original text using available tools.
Classification: Tweets are annotated with the label 1 if they are misogynistic or 0 if they are not. Misogynistic discourse were positive in 2637 tweets and negative in 1542 tweets.

Language: There's a label for the language of the tweet, es for Spanish and pt for Portuguese. There are 2087 tweets on Spanish and 2092 on Portuguese.

Inter Annotator Agreement:

The annotation for this database to detect misogyny was performed by six human annotators (five women and one man) which first languages are Spanish or Portuguese and that are based on the country of each dataset (Brazil, Argentina, Colombia and Mexico). In order to validate the annotation, all the classification labels had a checker different from the first annotator. If the checker agreed on the label, the classification remained. Otherwise, the tweet was removed from the database.

Methodology

In order to create the classifier, we made use of five Colaboratory Python Notebooks:

Data analysis: Basic analysis and statistics of the data.
Train and evaluate model (2 versions): Trains a model and evaluates it, one for Transformers and another for Adapters.
Labelling data (2 versions): Labels data from entry form from the notebook or from a file, one for Transformers and another for Adapters.

Pre-processing Tweets

There are several pre-processing steps on Natural Language Processing that can be applied to the data:

Lowers: All words are lowered. (e.g., GitHub → github)
Stop words: Remove words that are very common but don’t provide useful information. (e.g., prepositions)
Demojize: Change emojis to textual representation. (e.g., ☺️ → :smiling_face:)
URLs: Replace URLs with $URL$ (e.g., https://github.com/ → $URL$)
Mentions: Replace mentions with $MENTION$ (e.g., @github → $MENTION$)
Hashtags: Replace hashtags with $HASHTAG$ (e.g., #github → $HASHTAG$)
Emojis: Replace emojis with $EMOJI$ (e.g., 😃 → $EMOJI$)
Smileys: Replace smileys with $SMILEY (e.g., :) → $SMILEY)
Numbers: Replace numbers with $NUMBER$ (e.g., 4 → $NUMBER$)
Escaped characters: Replace escaped characters with $ESCAPE_CHAR$ (e.g., char(2) → $ESCAPE_CHAR$)

It is worth mentioning that we obtained better results lowering the text.

Along with that, we followed a machine learning methodology in which we used part of the labelled data to train a model which then is tested on another part of the data. During training we validated the progress of the model using a third part of the data.

Split	Percentage	Tweets
Train	80%	3,343 (1673 pt, 1669 es)
Test	10%	418 (210 pt, 209 es)
Validation	10%	418 (209 pt, 209 es)

Data Analysis

This section shows some statistics and graphics of the labelled data.

Vocabulary Statistics:

	Frequency	Description
count	19063	Number of different words
mean	3.444841	The average number words appear
std	13.935922	The standard deviation associated to the words
min	1	The minimum number that a word appears
25%	1	Up to 25% of the words appear
50%	1	Up to 50% of the words appear
75%	2	Up to 75% of the words appear
max	1062	The maximum number that a word appears

Vocabulary Frequencies

This graph shows the full vocabulary of the data:

Top 50 Words Frequencies

This graph shows the fifty most common words in the data:

Histograms of Length of Tweets

These graphs show the number of tweets with a certain length:

Wordcloud

This is a wordcloud with the most common words:

Pre-trained Models

We tested several Transformers and Adapters models. Nevertheless, cardiffnlp/twitter-xlm-roberta-base was the one with the better performance on F1 score:

Model	Type	both	es	pt
cardiffnlp/twitter-xlm-roberta-base	Multilingual	0.8728	0.9191	0.8235
neuralmind/bert-base-portuguese-cased	Portuguese	-	-	0.875
dccuchile/bert-base-spanish-wwm-uncased	Spanish	-	0.8985	-
mudes/multilingual-base	Multilingual	0.8641	0.8929	0.8339
neuralmind/bert-base-portuguese-cased	Portuguese	-	-	0.8496
PlanTL-GOB-ES/roberta-base-bne	Spanish	-	0.9027	-

For more information about all the model performances, checkout this technical report.

System Architecture

This is the workflow structure we follow for the project:

API Documentation

To enable communication with the API, we need a HTTP library to make a request-response. There are a few libraries to make HTTP requests in Python. However, we'll make use of requests due to it is well-documented and simple.

Requests Library

Installing package with conda:

conda install requests

Installing package with pip:

pip install requests

POST Request

The POST method is used when we want to submit data to be processed to the server. Here's an example of the syntax:

requests.post(url, headers={key: value}, json={key: value}, data={key: value})

For more information about HTTP request methods, checkout this guide.

Parameter Values

Parameter	Description
url	A string with the endpoint
headers	A dict to send to the url
json	A dict to send to the url
files	A dict of files to send to the url
data	A dict or list of tuples to send to the url

Status Code

The status code method shows the result when a request is sent. Responses can be grouped in five categories:

Informational 100-199
Succesful 200-299
Redirection 300-399
Client error 400-499
Server error 500-599

For more information about HTTP response status codes, checkout this guide.

Classifying Text

import requests

url = 'https://turing.iimas.unam.mx/pmdm/api/classify'

headers = {'access-token': 'token'}

tweet = {'tweet': 'text to classify'}

response = requests.post(url, headers=headers, json=tweet)

print(response.status_code)

response.json()

Default tweet arguments:

{
  'tweet': 'string',
  'use_lower': 'false',
  'demojize': 'true',
  'process_urls': 'true',
  'process_mentions': 'true',
  'process_hashtags': 'true',
  'process_emojis': 'false',
  'process_smileys': 'false',
  'process_numbers': 'false',
  'process_escaped_chars': 'false'
}

Classifying Files

import requests

url = 'https://turing.iimas.unam.mx/pmdm/api/classify_file'

headers = {'access-token': 'token'}

files = {'uploaded_file': open('filename', 'rb')}

# Tweet arguments required
data = {
        'model': 'es',
        'use_lower': 'false', 
        'demojize': 'true', 
        'process_urls':'true', 
        'process_mentions': 'true', 
        'process_hashtags': 'true', 
        'process_emojis': 'false', 
        'process_smileys': 'false', 
        'process_numbers': 'false', 
        'process_escaped_chars': 'false'}

response = requests.post(url, headers=headers, files=files, data=data)

print(response.status_code)

response.json()

More Examples

For more examples, see this Jupyter Notebook

Future Work

For future work we would like to create datasets from Latin American countries not included at this point in the interest of keeping the model's training. Furthermore, we will use the API to streamline the detection and to analyze instances of misogynistic discourse on social media.

Since we are aware that the management of an API is still not very accessible for a lot of newsrooms in the region due to technical requirements, we want to document and methodize use applications that hopefully inspire and help other organizations to work with this tool.

Contact Us

If you want to collaborate or just to know more about the project, please reach out to us:

Related Work

violentometro-online -> Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.gitsecret		.gitsecret
assets		assets
data		data
notebooks		notebooks
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README-ES.md		README-ES.md
README-PT.md		README-PT.md
README.md		README.md
config.ini.secret		config.ini.secret

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Español | Português

Political Misogynistic Discourse Monitor (PMDM)

Contents

Introduction

Data

Corpus Creation:

Labelling:

Inter Annotator Agreement:

Methodology

Pre-processing Tweets

Data Analysis

Vocabulary Statistics:

Vocabulary Frequencies

Top 50 Words Frequencies

Histograms of Length of Tweets

Wordcloud

Pre-trained Models

System Architecture

API Documentation

Requests Library

POST Request

Parameter Values

Status Code

Classifying Text

Classifying Files

More Examples

Future Work

Contact Us

Related Work

Bibliography

About

Releases

Packages

Contributors 2

Languages

fer-aguirre/pmdm

Folders and files

Latest commit

History

Repository files navigation

Español | Português

Political Misogynistic Discourse Monitor (PMDM)

Contents

Introduction

Data

Corpus Creation:

Labelling:

Inter Annotator Agreement:

Methodology

Pre-processing Tweets

Data Analysis

Vocabulary Statistics:

Vocabulary Frequencies

Top 50 Words Frequencies

Histograms of Length of Tweets

Wordcloud

Pre-trained Models

System Architecture

API Documentation

Requests Library

POST Request

Parameter Values

Status Code

Classifying Text

Classifying Files

More Examples

Future Work

Contact Us

Related Work

Bibliography

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages