WikiNews dataset

Data for PLOS ONE 2024 paper Breaking News: Unveiling a New Dataset for Portuguese News Classification and Comparative Analysis of Approaches

WikiNotícias (WikiNews) is a news channel, where articles can be created collaboratively.

Code from this project can be used to leverage this data for text categorization studies.

Dataset construction procedure

Download this repository
Install requirements.txt
Download content from MediaWiki dump service.
https://dumps.wikimedia.org/ptwikinews/
Select "all pages, current versions only" (ptwikinews-YYYYMMDD-pages-meta-current.xml.bz2)
On referenced paper, we used the the May 1, 2022 file.
https://dumps.wikimedia.your.org/ptwikinews/20220401/
Uncompress file on folder ./content/raw
Convert content to json:

    python extractor.py \
           --input content/raw/ptwikinews-20220401-pages-meta-current.xml \
           --output content/json/wikinews_full.json

Select articles by category removing articles that fall into more than one of the indicated categories.

    python seletor.py \
           --input content/json/wikinews_full.json \
           --output content/json/wikinews_categories.json \
           --categories 'Desporto' 'Crime, Direito e Justiça' 'Saúde' 'Economia e negócios' 'Política'

Split data into train and test.
This is done in two steps. In the first one, a file containing the message id and part is generated. It is useful to ensure replication.
In the second step, the generated file is applied to the data set, producing the partitions.
Our generated file is available on 'content/json/split ids.csv'. To use it, skip the partition file production step (first command in the following box).

    python train_split.py \
           --input content/json/wikinews_categories.json \
           --splitfile content/json/split_ids.csv \
           --operation generate

    python train_split.py \
           --input content/json/wikinews_categories.json \
           --splitfile content/json/split_ids.csv \
           --operation apply --train content/json/wikinews_train.json \
           --test content/json/wikinews_test.json

Citation

If you find this dataset useful, please cite:

@article{10.1371/journal.pone.0296929,
    doi = {10.1371/journal.pone.0296929},
    author = {Garcia, Klaifer AND Shiguihara, Pedro AND Berton, Lilian},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {Breaking news: Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches},
    year = {2024},
    month = {01},
    volume = {19},
    url = {https://doi.org/10.1371/journal.pone.0296929},
    pages = {1-15},
    number = {1},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WikiNews dataset

Dataset construction procedure

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

WikiNews dataset

Dataset construction procedure

Citation