generated from hackforla/.github-hackforla-base-repo-template
-
-
Notifications
You must be signed in to change notification settings - Fork 17
99 Neighborhood Council Websites Technologies Used Analysis
Willa Mannering edited this page May 26, 2022
·
11 revisions
Project to create a scraper to get information from builtwith.com on technologies used by 99 neighborhood council website.
Automate scrape job to run periodically.
- Be able to run script on demand
- Gather the following information
- Name of Tech on each NC site (their entire site, not just the homepage)
- URL of Tech
- Category of Tech
- Total NCs using Tech
- Total Catagories
- 2021-07-01 New issue created https://github.com/hackforla/data-science/issues/44
- 2021-07-08 Accessing the API didn’t return the info required for the project, so we will use selenium to scrape
- 2021-08-09 Sophia ran a video tutorial session on scraping with Selenium and shared some starter code. Next steps is for the person assigned to issue to parse the output into a usable format and save it as a file
- 2021-08-30 Abe joined as the Cop PM and said he would get up to speed and then move the issue forward.
- 2021-09-13 Rajinder assigned
- Rajinder creates script for webscraping for all websites. It includes a dockerfile and produces a json file as its output. He adds code to DS repository
- Sofia helps Rajinder sort out the API rate limitation problem by having the script only hit the API once every 30 seconds.
- Rajinder updates his person version of the script
- Currently waiting for him to update DS repository with changes (in the meantime Ryan has saved the code from Rajinders repo, just in case).
- Willa updated Rajinder's script to also include tech URL and tech category
OCS: Builtwith data on 99 NCs technologies
Updated spreadsheet, OCS: Builtwith tech_table
- Builtwith
- https://builtwith.com
-
builtwith API
- API limitations: Some sites, are resistant to being crawled (WordPress, for instance https://atwatervillage.org/calendar/). So what we need is a list of all the sites that can't be put through the sitemap maker. See notes about WordPress site crawling: https://community.funnelback.com/knowledge-base/implementation/Gather-And-Index/integration/crawl-wordpress-sites
- Selenium
- Docker
- Target Website List Here - this is one tab on a larger analysis workbook.
- Data Science wiki, 99 NC project
- Spreadsheet of Rajinder's script results
- Spreadsheet of updated script results
- code on data-science repo with Rajinder's code - this will need to be moved to another directory. It has nothing to do with 311. Its a project for Open Community Survey
- Rajinder's personal repo - this seems to be updated more recently than the one on data-science.
- NC data scraping task for web technologies open-community-survey/#25
- Conduct analysis of 99 NC website technologies open-community-survey/#52
@akibrhast, @ava li, @Sarah Williams, @wendywilhelm10 @rajindermavi @ShikaZzz @JessicaFB @Poorvi Rao
@kalyaniraman, @akhaleghi, @ryanswan @salice