Code related to scripts, "scrapers", which scrape the agency information and post the information to the Django API.
├── agency_api_service.py - connects to GovLens API for agency info
├── agency_dataaccessor.py - read/write to/from database containing scraped info
├── lighthouse.py - connects to Google Lighthouse API
├── process_agency_info.py - connects to an agency site & runs scrapers
├── README.rst - this file!
├── scrape_handler.py - **Start here!** Starts API services and maps to agency processors.
├── urls.json - list of URLS pointing to government sites
├── data/
│ └── agencies.csv - spreadsheet containing scraped information (match of Google Sheets?)
└── scrapers/
├── __init__.py
├── accessibility_scraper.py - scrapes for multi-language, performance, mobile-bility
├── base_api_client.py
├── base_scraper.py - base class for scrapers to inherit
├── security_scraper.py - scrapes for HTTPS & privacy policy
└── social_scraper.py - scrapes for phone number, email, address, social media
There are a few required environmental variables. The easiest way to set them in development is to create a file called .env in the root directory of this repository (don't commit this file). The file (named .env) should contain the following text:
GOVLENS_API_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX GOVLENS_API_ENDPOINT=http://127.0.0.1:8000/api/agencies/ GOOGLE_API_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXX
To get the GOOGLE_API_TOKEN, you need to visit the following page: https://developers.google.com/speed/docs/insights/v5/get-started
To get the GOVLENS_API_TOKEN, run python3 manage.py create_scraper_user. Copy the token from the command output and paste it into the .env file.
Once you have created the .env file as mentioned above, run the scraper:
# run the following from the root directory of the repository python3 -m scrapers.scrape_handler
The scraper is intended to be used both locally and on AWS Lambda.
The scrapers directory in the root of this repository is the top-level Python package for this project. This means that any absolute imports should begin with scrapers.MODULE_NAME_HERE.
scrapers/scrape_handler.py is the main Python module invoked. On AWS Lambda, the method scrape_handler.scrape_data() is imported and called directly.
Pushing it to AWS lambda:
- zip the
scraper/folder. - go to AWS lamba and upload the zipped folder: https://console.aws.amazon.com/lambda/home?region=us-east-1#/functions
- test the lambda by using this json (??)
- confirm that there are no errors by looking at cloudwatch logs: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=/aws/lambda/scrapers;streamFilter=typeLogStreamPrefix