The Contentious Contexts Corpus dataset. This project was carried out in the context of the EuropeanaTech Challenge for Europeana Artificial Intelligence and Machine Learning datasets.
The dataset is made available through the CC-BY license
The dataset is supported with the [Project Documentation](Dataset/Project Documentation.pdf) and the Datasheet.
The dataset is split into 4 sub-sets to reduce repetition in the data (and therefore stored size), and improve clarity of the data for inspection.
- Extracts.csv: 2720 Dutch newspaper articles extracts obtained from OCR'd versions of the Europeana Newspaper collection, as provided by KB National Library of the Netherlands
- extract_id: H – expert annotators, c – control samples
- target: a target word that was used in a query
- target_compound: a target word found in an extract
- target_compound_bolded: a bolded target word found in an extract (mathematical sans-serif bold italic small unicode charachters are used)
- text: extract text of 5 sentences, centred around a bolded target word
- url: a url to Delpher to view the newspaper scan and the OCR'd text
- Annotations.csv: Anonymised participant multi-choice responses; in being asked to define whether the target word in the given textual context is contentious (to even the slightest degree), according to present-day sensibilities
- anonymised_participant_id: 'unknown_' prefix – expert annotators, 0–398 – Prolific annotators
- extract_id
- response: the multiple-choice options for each extract “Omstreden naar huidige maatstaven” (“Contentious according to current standards”), “Niet omstreden" (“Not contentious”), “Weet ik niet” (“I don’t know”), “Onleesbare OCR” ("Illegible OCR”)
- suggestion: a suggested word that an annotator found contentios in the given extract (can be empty)
- is_control: boolean, True if an extract was used as a control one
- Demographics.csv: Anonymised Prolific annotators demographic data, no demographic data was collected from the expert annotators
- anonymised_participant_id
- time_taken: sec
- age
- Country of Birth
- Current Country of Residence
- Employment Status
- First Language
- Fluent languages
- Nationality
- Sex
- Student Status
- Metadata.csv: metadata corresponding to the extracts in Extracts.csv. This metadata is extracted from the KB via the provided OAI-PMH protocol
- url: same as in Extracts.csv
- europeana_issue_id
- datestamp
- date
- publisher
- spatial_distribution
- spatial_origin
- spatial_origin
Additional files:
alpha_per_group.csv group: groups of annotators, group_1 – group_57 Prolific groups, group_58 – group_60 experts groups alpha: Krippendorff's alpha scores (annotators agreement) num_annotators: number of annotators in a group annotators_id: a list (str) of annotators' IDs in a group extracts_id: a list (str) of extracts IDs in a batch (or per group)
percentage_agreement.csv extract_id: same as in Extracts.csv target: same as in Extracts.csv omstreden: number of annotators in a group selected the option “Omstreden naar huidige maatstaven” niet_omstreden: number of annotators in a group selected the option “Niet omstreden” weet_ik_niet: number of annotators in a group selected the option “Weet ik niet” bad_ocr: number of annotators in a group selected the option “Onleesbare OCR” num_annotators: number of annotators in a group percentage_agreement: percentage agreement between annotators in a group per exrtract
See here for instructions for recreating the dataset components: i.e., sampling extracts, auto-assembly of Google Forms, creation of the datasets files.
See here for instructions for performing the analyses/ creating the figures presented in K-Cap 2021 paper.