Data for PLOS ONE 2024 paper Breaking News: Unveiling a New Dataset for Portuguese News Classification and Comparative Analysis of Approaches
WikiNotícias (WikiNews) is a news channel, where articles can be created collaboratively.
Code from this project can be used to leverage this data for text categorization studies.
- Download this repository
- Install requirements.txt
- Download content from MediaWiki dump service.
https://dumps.wikimedia.org/ptwikinews/
Select "all pages, current versions only" (ptwikinews-YYYYMMDD-pages-meta-current.xml.bz2)
On referenced paper, we used the the May 1, 2022 file.
https://dumps.wikimedia.your.org/ptwikinews/20220401/ - Uncompress file on folder ./content/raw
- Convert content to json:
python extractor.py \
--input content/raw/ptwikinews-20220401-pages-meta-current.xml \
--output content/json/wikinews_full.json
- Select articles by category removing articles that fall into more than one of the indicated categories.
python seletor.py \
--input content/json/wikinews_full.json \
--output content/json/wikinews_categories.json \
--categories 'Desporto' 'Crime, Direito e Justiça' 'Saúde' 'Economia e negócios' 'Política'
- Split data into train and test.
This is done in two steps. In the first one, a file containing the message id and part is generated. It is useful to ensure replication.
In the second step, the generated file is applied to the data set, producing the partitions.
Our generated file is available on 'content/json/split ids.csv'. To use it, skip the partition file production step (first command in the following box).
python train_split.py \
--input content/json/wikinews_categories.json \
--splitfile content/json/split_ids.csv \
--operation generate
python train_split.py \
--input content/json/wikinews_categories.json \
--splitfile content/json/split_ids.csv \
--operation apply --train content/json/wikinews_train.json \
--test content/json/wikinews_test.json
If you find this dataset useful, please cite:
@article{10.1371/journal.pone.0296929,
doi = {10.1371/journal.pone.0296929},
author = {Garcia, Klaifer AND Shiguihara, Pedro AND Berton, Lilian},
journal = {PLOS ONE},
publisher = {Public Library of Science},
title = {Breaking news: Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches},
year = {2024},
month = {01},
volume = {19},
url = {https://doi.org/10.1371/journal.pone.0296929},
pages = {1-15},
number = {1},
}