This is a news site python-based web scraper.
- Clone this repo and move to repo folder.
python -m venv .env
source .env/bin/activate
pip install -r requirements.txt
python main.py
and wait some minutes.
Settings file is scraper_config.yaml.
There is a news-sites list. Follow the current file structure.
For every site, config is:
- sitename: Folder name where scraped news will be saved.
- site: Url where news-site lists its news.
- links: Xpath for news articles urls.
- title: Xpath for news article title text.
- summary: Xpath for news article summary text.
- body_paragraphs: Xpath for body paragraphs text