Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/common_crawler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,17 @@ jobs:
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Install dependencies
run: pip install -r common_crawler/requirements_common_crawler_action.txt
run: pip install -r source_collectors/common_crawler/requirements_common_crawler_action.txt
- name: Run script
run: python common_crawler/main.py CC-MAIN-2024-10 *.gov police --config common_crawler/config.ini --pages 20
run: python source_collectors/common_crawler/main.py CC-MAIN-2024-10 *.gov police --config source_collectors/common_crawler/config.ini --pages 20
- name: Configure Git
run: |
git config --local user.email "[email protected]"
git config --local user.name "GitHub Action"
- name: Add common_crawler cache and common_crawler batch_info
run: |
git add common_crawler/data/cache.json
git add common_crawler/data/batch_info.csv
git add source_collectors/common_crawler/data/cache.json
git add source_collectors/common_crawler/data/batch_info.csv
- name: Commit changes
run: git commit -m "Update common_crawler cache and batch_info"
- name: Push changes
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python_checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@ jobs:
uses: reviewdog/action-flake8@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
flake8_args: --ignore E501,W291,W293,D401,D400,E402,E302,D200,D202,D205
flake8_args: --ignore E501,W291,W293,D401,D400,E402,E302,D200,D202,D205,W503,E203,D204,D403
level: warning
218 changes: 218 additions & 0 deletions source_collectors/ckan/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# CKAN Scraper

## Introduction

This scraper can be used to retrieve package information from [CKAN](https://ckan.org/), which hosts open data projects such as <https://data.gov/>. CKAN API documentation can be found at <https://docs.ckan.org/en/2.9/api/>.

Running the scraper will output a list of packages to a CSV file using the search terms.

## Definitions

* `Package` - Also called a dataset, is a page containing relevant information about a dataset. For example, this page is a package: <https://catalog.data.gov/dataset/electric-vehicle-population-data>.
* `Collection` - A grouping of child packages, related to a parent package. This is seperate from a group.
* `Group` - Also called a topic, is a grouping of packages. Packages in a group do not have a parent package. Groups can also contain subgroups.
* `Organization` - Organizations are what the data in packages belong to, such as "City of Austin" or "Department of Energy". Organization types are groups of organizations that share something in common with each other.

## Files

* `scrape_ckan_data_portals.py` - The main scraper file. Running this will execute a search accross multiple CKAN instances and output the results to a CSV file.
* `search_terms.py` - The search terms and CKAN portals to search from.
* `ckan_scraper_toolkit.py` - Toolkit of functions that use ckanapi to retrieve packages from CKAN data portals.

## Setup

1. In a terminal, navigate to the CKAN scraper folder
```cmd
cd scrapers_library/data_portals/ckan/
```
2. Create and activate a Python virtual environment
```cmd
python -m venv venv
source venv/bin/activate
```

3. Install the requirements
```cmd
pip install -r requirements.txt
```
4. Run the multi-portal CKAN scraper
```cmd
python scrape_ckan_data_portals.py
```
5. Review the generated `results.csv` file.

## How can I tell if a website I want to scrape is hosted using CKAN?

There's no easy way to tell, some websites will reference CKAN or link back to the CKAN documentation while others will not. There doesn't seem to be a database of all CKAN instances either.

The best way to determine if a data catalog is using CKAN is to attempt to query its API. To do this:

1. In a web browser, navigate to the website's data catalog (e.g. for data.gov this is at <https://catalog.data.gov/dataset/>)
2. Copy the first part of the link (e.g. <https://catalog.data.gov/>)
3. Paste it in the browser's URL bar and add `api/3/action/package_search` to the end (e.g. <https://catalog.data.gov/api/3/action/package_search>)

*NOTE: Some hosts use a different base URL for API requests. For example, Canada's Open Government Portal can be found at <https://search.open.canada.ca/opendata/> while the API access link is <https://open.canada.ca/data/en/api/3/action/package_search> as described in their [Access our API](https://open.canada.ca/en/access-our-application-programming-interface-api) page*

Another way to tell is by looking at the page layout. Most CKAN instances have a similar layout to one another. You can see an example at <https://catalog.data.gov/dataset/> and <https://opendata.swiss/en/group/gove>. Both catalogues have a sidebar on the left with search refinement options, a search box on the top below the page title, and a list of datasets to the right of the sidebar among other similarities.

## Documentation for ckan_scraper_toolkit.py

### On ckanapi return data

Accross CKAN instances, the ckanapi return data is largely the same in terms of layout. The key difference among these instances is in the `extras` key, where an instance may define its own custom keys. An example ckanapi return is provided below with truncation to save on space. This is the general layout that is returned by most of the toolkit's functions:

```json
{
"author": null,
"author_email": null,
"id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
"maintainer": "NYC OpenData",
"maintainer_email": "[email protected]",
"metadata_created": "2020-11-10T17:05:36.995577",
"metadata_modified": "2024-10-25T20:28:59.948113",
"name": "nypd-arrest-data-year-to-date",
"notes": "This is a breakdown of every arrest effected in NYC by the NYPD during the current year.\n This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. \n Each record represents an arrest effected in NYC by the NYPD and includes information about the type of crime, the location and time of enforcement. \nIn addition, information related to suspect demographics is also included. \nThis data can be used by the public to explore the nature of police enforcement activity. \nPlease refer to the attached data footnotes for additional information about this dataset.",
"organization": {
"id": "1149ee63-2fff-494e-82e5-9aace9d3b3bf",
"name": "city-of-new-york",
"title": "City of New York",
"description": "",
...
},
"title": "NYPD Arrest Data (Year to Date)",
"extras": [
{
"key": "accessLevel",
"value": "public"
},
{
"key": "landingPage",
"value": "https://data.cityofnewyork.us/d/uip8-fykc"
},
{
"key": "publisher",
"value": "data.cityofnewyork.us"
},
...
],
"groups": [
{
"description": "Local Government Topic - for all datasets with state, local, county organizations",
"display_name": "Local Government",
"id": "7d625e66-9e91-4b47-badd-44ec6f16b62b",
"name": "local",
"title": "Local Government",
...
}
],
"resources": [
{
"created": "2020-11-10T17:05:37.001960",
"description": "",
"format": "CSV",
"id": "c48f1a1a-5efb-4266-9572-769ed1c9b472",
"metadata_modified": "2020-11-10T17:05:37.001960",
"name": "Comma Separated Values File",
"no_real_name": true,
"package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
"url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.csv?accessType=DOWNLOAD",
...
},
{
"created": "2020-11-10T17:05:37.001970",
"describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.rdf",
"describedByType": "application/rdf+xml",
"description": "",
"format": "RDF",
"id": "5c137f71-4e20-49c5-bd45-a562952195fe",
"metadata_modified": "2020-11-10T17:05:37.001970",
"name": "RDF File",
"package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
"url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.rdf?accessType=DOWNLOAD",
...
},
...
],
"tags": [
{
"display_name": "arrest",
"id": "a76dff3f-cba8-42b4-ab51-1aceb059d16f",
"name": "arrest",
"state": "active",
"vocabulary_id": null
},
{
"display_name": "crime",
"id": "df442823-c823-4890-8fca-805427bd8dd9",
"name": "crime",
"state": "active",
"vocabulary_id": null
},
...
],
"relationships_as_subject": [],
"relationships_as_object": [],
...
}
```

---
`ckan_package_search(base_url: str, query: Optional[str], rows: Optional[int], start: Optional[int], **kwargs) -> list[dict[str, Any]]`

Searches for packages (datasets) in a CKAN data portal that satisfies a given search criteria.

### Parameters

* **base_url** - The base URL to search from. e.g. "https://catalog.data.gov/"
* **query (optional)** - The keyword string to search for. e.g. "police". Leaving empty will return all packages in the package list. Multi-word searches should be done with double quotes around the search term. For example, '"calls for service"' will return packages with the term "calls for service" while 'calls for service' will return packages with either "calls", "for", or "service" as keywords.
* **rows (optional)** - The maximum number of results to return. Leaving empty will return all results.
* **start (optional)** - Which result number to start at. Leaving empty will start at the first result.
* **kwargs (optional)** - Additional keyword arguments. For more information on acceptable keyword arguments and their function see <https://docs.ckan.org/en/2.10/api/index.html#ckan.logic.action.get.package_search>

### Return

The function returns a list of dictionaries containing matching package results.

---

`ckan_package_search_from_organization(base_url: str, organization_id: str) -> list[dict[str, Any]]`

Returns a list of CKAN packages from an organization. Due to CKAN limitations, only 10 packages are able to be returned.

### Parameters

* **base_url** - The base URL to search from. e.g. "https://catalog.data.gov/"
* **organization_id** - The ID of the organization. This can be retrieved by searching for a package and finding the "id" key in the "organization" key.

### Return

The function returns a list of dictionaries containing matching package results.

---

`ckan_group_package_show(base_url: str, id: str, limit: Optional[int]) -> list[dict[str, Any]]`

Returns a list of CKAN packages that belong to a particular group.

* **base_url** - The base URL of the CKAN portal. e.g. "https://catalog.data.gov/"
* **id** - The group's ID. This can be retrieved by searching for a package and finding the "id" key in the "groups" key.
* **limit** - The maximum number of results to return, leaving empty will return all results.

### Return

The function returns a list of dictionaries representing the packages associated with the group.

---

`ckan_collection_search(base_url: str, collection_id: str) -> list[Package]`

Returns a list of CKAN package information that belong to a collection. When querying the API, CKAN data portals are supposed to have relationships returned along with the rest of the data. However, in practice not all data portals have it set up this way. Since child packages are not able to be queried directly, they will not show up in any search results. To get around this, this function will manually scrape the information of all child packages related to the given parent.

*NOTE: This function has only been tested on <https://catalog.data.gov/>. It is likely it will not work properly on other platforms.*

* **base_url** - The base URL of the CKAN portal before the collection ID. e.g. "https://catalog.data.gov/dataset/"
* **collection_id** - The ID of the parent package. This can be found by querying the parent package and using the "id" key, or by navigating to the list of child packages and looking in the URL. e.g. In <https://catalog.data.gov/dataset/?collection_package_id=7b1d1941-b255-4596-89a6-99e1a33cc2d8> the collection_id is "7b1d1941-b255-4596-89a6-99e1a33cc2d8"

### Return

List of Package objects representing the child packages associated with the collection.
Loading