Skip to content

Latest commit

 

History

History
18 lines (13 loc) · 595 Bytes

File metadata and controls

18 lines (13 loc) · 595 Bytes

NatureMagazineScraper

Scrap open-access Nature articles and store them as txt files.

Key Features

  • User can specify which year's articles to scrape/analyze
  • User can specify maximum word count per word per article to reduce over-counting

scraper.py

Scrape articles using Beautiful Soup and store them as text files

analyzer.py

Parse scrapped articles and sum up word counts

data_cleaner.py

Clean common words and other baised words

better_data_cleaner.py

A better approach to cleaning scraped data using TF-IDF, document frequency analysis and z-score outlier detection