Skip to content

Klaifer/PortugueseNewsDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikiNews dataset

Data for PLOS ONE 2024 paper Breaking News: Unveiling a New Dataset for Portuguese News Classification and Comparative Analysis of Approaches

WikiNotícias (WikiNews) is a news channel, where articles can be created collaboratively.

Code from this project can be used to leverage this data for text categorization studies.

Dataset construction procedure

  1. Download this repository
  2. Install requirements.txt
  3. Download content from MediaWiki dump service.
    https://dumps.wikimedia.org/ptwikinews/
    Select "all pages, current versions only" (ptwikinews-YYYYMMDD-pages-meta-current.xml.bz2)
    On referenced paper, we used the the May 1, 2022 file.
    https://dumps.wikimedia.your.org/ptwikinews/20220401/
  4. Uncompress file on folder ./content/raw
  5. Convert content to json:
    python extractor.py \
           --input content/raw/ptwikinews-20220401-pages-meta-current.xml \
           --output content/json/wikinews_full.json 
  1. Select articles by category removing articles that fall into more than one of the indicated categories.
    python seletor.py \
           --input content/json/wikinews_full.json \
           --output content/json/wikinews_categories.json \
           --categories 'Desporto' 'Crime, Direito e Justiça' 'Saúde' 'Economia e negócios' 'Política'
  1. Split data into train and test.
    This is done in two steps. In the first one, a file containing the message id and part is generated. It is useful to ensure replication.
    In the second step, the generated file is applied to the data set, producing the partitions.
    Our generated file is available on 'content/json/split ids.csv'. To use it, skip the partition file production step (first command in the following box).
    python train_split.py \
           --input content/json/wikinews_categories.json \
           --splitfile content/json/split_ids.csv \
           --operation generate

    python train_split.py \
           --input content/json/wikinews_categories.json \
           --splitfile content/json/split_ids.csv \
           --operation apply --train content/json/wikinews_train.json \
           --test content/json/wikinews_test.json

Citation

If you find this dataset useful, please cite:

@article{10.1371/journal.pone.0296929,
    doi = {10.1371/journal.pone.0296929},
    author = {Garcia, Klaifer AND Shiguihara, Pedro AND Berton, Lilian},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {Breaking news: Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches},
    year = {2024},
    month = {01},
    volume = {19},
    url = {https://doi.org/10.1371/journal.pone.0296929},
    pages = {1-15},
    number = {1},
}

About

A news categorization Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages