diff --git a/.github/workflows/common_crawler.yaml b/.github/workflows/common_crawler.yaml
index 734bb94d..52b4007d 100644
--- a/.github/workflows/common_crawler.yaml
+++ b/.github/workflows/common_crawler.yaml
@@ -23,17 +23,17 @@ jobs:
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Install dependencies
- run: pip install -r common_crawler/requirements_common_crawler_action.txt
+ run: pip install -r source_collectors/common_crawler/requirements_common_crawler_action.txt
- name: Run script
- run: python common_crawler/main.py CC-MAIN-2024-10 *.gov police --config common_crawler/config.ini --pages 20
+ run: python source_collectors/common_crawler/main.py CC-MAIN-2024-10 *.gov police --config source_collectors/common_crawler/config.ini --pages 20
- name: Configure Git
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
- name: Add common_crawler cache and common_crawler batch_info
run: |
- git add common_crawler/data/cache.json
- git add common_crawler/data/batch_info.csv
+ git add source_collectors/common_crawler/data/cache.json
+ git add source_collectors/common_crawler/data/batch_info.csv
- name: Commit changes
run: git commit -m "Update common_crawler cache and batch_info"
- name: Push changes
diff --git a/.github/workflows/python_checks.yml b/.github/workflows/python_checks.yml
index 4efaf15d..7f5bef91 100644
--- a/.github/workflows/python_checks.yml
+++ b/.github/workflows/python_checks.yml
@@ -18,5 +18,5 @@ jobs:
uses: reviewdog/action-flake8@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
- flake8_args: --ignore E501,W291,W293,D401,D400,E402,E302,D200,D202,D205
+ flake8_args: --ignore E501,W291,W293,D401,D400,E402,E302,D200,D202,D205,W503,E203,D204,D403
level: warning
diff --git a/source_collectors/ckan/README.md b/source_collectors/ckan/README.md
new file mode 100644
index 00000000..be6c65cf
--- /dev/null
+++ b/source_collectors/ckan/README.md
@@ -0,0 +1,218 @@
+# CKAN Scraper
+
+## Introduction
+
+This scraper can be used to retrieve package information from [CKAN](https://ckan.org/), which hosts open data projects such as . CKAN API documentation can be found at .
+
+Running the scraper will output a list of packages to a CSV file using the search terms.
+
+## Definitions
+
+* `Package` - Also called a dataset, is a page containing relevant information about a dataset. For example, this page is a package: .
+* `Collection` - A grouping of child packages, related to a parent package. This is seperate from a group.
+* `Group` - Also called a topic, is a grouping of packages. Packages in a group do not have a parent package. Groups can also contain subgroups.
+* `Organization` - Organizations are what the data in packages belong to, such as "City of Austin" or "Department of Energy". Organization types are groups of organizations that share something in common with each other.
+
+## Files
+
+* `scrape_ckan_data_portals.py` - The main scraper file. Running this will execute a search accross multiple CKAN instances and output the results to a CSV file.
+* `search_terms.py` - The search terms and CKAN portals to search from.
+* `ckan_scraper_toolkit.py` - Toolkit of functions that use ckanapi to retrieve packages from CKAN data portals.
+
+## Setup
+
+1. In a terminal, navigate to the CKAN scraper folder
+ ```cmd
+ cd scrapers_library/data_portals/ckan/
+ ```
+2. Create and activate a Python virtual environment
+ ```cmd
+ python -m venv venv
+ source venv/bin/activate
+ ```
+
+3. Install the requirements
+ ```cmd
+ pip install -r requirements.txt
+ ```
+4. Run the multi-portal CKAN scraper
+ ```cmd
+ python scrape_ckan_data_portals.py
+ ```
+5. Review the generated `results.csv` file.
+
+## How can I tell if a website I want to scrape is hosted using CKAN?
+
+There's no easy way to tell, some websites will reference CKAN or link back to the CKAN documentation while others will not. There doesn't seem to be a database of all CKAN instances either.
+
+The best way to determine if a data catalog is using CKAN is to attempt to query its API. To do this:
+
+1. In a web browser, navigate to the website's data catalog (e.g. for data.gov this is at )
+2. Copy the first part of the link (e.g. )
+3. Paste it in the browser's URL bar and add `api/3/action/package_search` to the end (e.g. )
+
+*NOTE: Some hosts use a different base URL for API requests. For example, Canada's Open Government Portal can be found at while the API access link is as described in their [Access our API](https://open.canada.ca/en/access-our-application-programming-interface-api) page*
+
+Another way to tell is by looking at the page layout. Most CKAN instances have a similar layout to one another. You can see an example at and . Both catalogues have a sidebar on the left with search refinement options, a search box on the top below the page title, and a list of datasets to the right of the sidebar among other similarities.
+
+## Documentation for ckan_scraper_toolkit.py
+
+### On ckanapi return data
+
+Accross CKAN instances, the ckanapi return data is largely the same in terms of layout. The key difference among these instances is in the `extras` key, where an instance may define its own custom keys. An example ckanapi return is provided below with truncation to save on space. This is the general layout that is returned by most of the toolkit's functions:
+
+```json
+{
+ "author": null,
+ "author_email": null,
+ "id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
+ "maintainer": "NYC OpenData",
+ "maintainer_email": "no-reply@data.cityofnewyork.us",
+ "metadata_created": "2020-11-10T17:05:36.995577",
+ "metadata_modified": "2024-10-25T20:28:59.948113",
+ "name": "nypd-arrest-data-year-to-date",
+ "notes": "This is a breakdown of every arrest effected in NYC by the NYPD during the current year.\n This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. \n Each record represents an arrest effected in NYC by the NYPD and includes information about the type of crime, the location and time of enforcement. \nIn addition, information related to suspect demographics is also included. \nThis data can be used by the public to explore the nature of police enforcement activity. \nPlease refer to the attached data footnotes for additional information about this dataset.",
+ "organization": {
+ "id": "1149ee63-2fff-494e-82e5-9aace9d3b3bf",
+ "name": "city-of-new-york",
+ "title": "City of New York",
+ "description": "",
+ ...
+ },
+ "title": "NYPD Arrest Data (Year to Date)",
+ "extras": [
+ {
+ "key": "accessLevel",
+ "value": "public"
+ },
+ {
+ "key": "landingPage",
+ "value": "https://data.cityofnewyork.us/d/uip8-fykc"
+ },
+ {
+ "key": "publisher",
+ "value": "data.cityofnewyork.us"
+ },
+ ...
+ ],
+ "groups": [
+ {
+ "description": "Local Government Topic - for all datasets with state, local, county organizations",
+ "display_name": "Local Government",
+ "id": "7d625e66-9e91-4b47-badd-44ec6f16b62b",
+ "name": "local",
+ "title": "Local Government",
+ ...
+ }
+ ],
+ "resources": [
+ {
+ "created": "2020-11-10T17:05:37.001960",
+ "description": "",
+ "format": "CSV",
+ "id": "c48f1a1a-5efb-4266-9572-769ed1c9b472",
+ "metadata_modified": "2020-11-10T17:05:37.001960",
+ "name": "Comma Separated Values File",
+ "no_real_name": true,
+ "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
+ "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.csv?accessType=DOWNLOAD",
+ ...
+ },
+ {
+ "created": "2020-11-10T17:05:37.001970",
+ "describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.rdf",
+ "describedByType": "application/rdf+xml",
+ "description": "",
+ "format": "RDF",
+ "id": "5c137f71-4e20-49c5-bd45-a562952195fe",
+ "metadata_modified": "2020-11-10T17:05:37.001970",
+ "name": "RDF File",
+ "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
+ "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.rdf?accessType=DOWNLOAD",
+ ...
+ },
+ ...
+ ],
+ "tags": [
+ {
+ "display_name": "arrest",
+ "id": "a76dff3f-cba8-42b4-ab51-1aceb059d16f",
+ "name": "arrest",
+ "state": "active",
+ "vocabulary_id": null
+ },
+ {
+ "display_name": "crime",
+ "id": "df442823-c823-4890-8fca-805427bd8dd9",
+ "name": "crime",
+ "state": "active",
+ "vocabulary_id": null
+ },
+ ...
+ ],
+ "relationships_as_subject": [],
+ "relationships_as_object": [],
+ ...
+}
+```
+
+---
+`ckan_package_search(base_url: str, query: Optional[str], rows: Optional[int], start: Optional[int], **kwargs) -> list[dict[str, Any]]`
+
+Searches for packages (datasets) in a CKAN data portal that satisfies a given search criteria.
+
+### Parameters
+
+* **base_url** - The base URL to search from. e.g. "https://catalog.data.gov/"
+* **query (optional)** - The keyword string to search for. e.g. "police". Leaving empty will return all packages in the package list. Multi-word searches should be done with double quotes around the search term. For example, '"calls for service"' will return packages with the term "calls for service" while 'calls for service' will return packages with either "calls", "for", or "service" as keywords.
+* **rows (optional)** - The maximum number of results to return. Leaving empty will return all results.
+* **start (optional)** - Which result number to start at. Leaving empty will start at the first result.
+* **kwargs (optional)** - Additional keyword arguments. For more information on acceptable keyword arguments and their function see
+
+### Return
+
+The function returns a list of dictionaries containing matching package results.
+
+---
+
+`ckan_package_search_from_organization(base_url: str, organization_id: str) -> list[dict[str, Any]]`
+
+Returns a list of CKAN packages from an organization. Due to CKAN limitations, only 10 packages are able to be returned.
+
+### Parameters
+
+* **base_url** - The base URL to search from. e.g. "https://catalog.data.gov/"
+* **organization_id** - The ID of the organization. This can be retrieved by searching for a package and finding the "id" key in the "organization" key.
+
+### Return
+
+The function returns a list of dictionaries containing matching package results.
+
+---
+
+`ckan_group_package_show(base_url: str, id: str, limit: Optional[int]) -> list[dict[str, Any]]`
+
+Returns a list of CKAN packages that belong to a particular group.
+
+* **base_url** - The base URL of the CKAN portal. e.g. "https://catalog.data.gov/"
+* **id** - The group's ID. This can be retrieved by searching for a package and finding the "id" key in the "groups" key.
+* **limit** - The maximum number of results to return, leaving empty will return all results.
+
+### Return
+
+The function returns a list of dictionaries representing the packages associated with the group.
+
+---
+
+`ckan_collection_search(base_url: str, collection_id: str) -> list[Package]`
+
+Returns a list of CKAN package information that belong to a collection. When querying the API, CKAN data portals are supposed to have relationships returned along with the rest of the data. However, in practice not all data portals have it set up this way. Since child packages are not able to be queried directly, they will not show up in any search results. To get around this, this function will manually scrape the information of all child packages related to the given parent.
+
+*NOTE: This function has only been tested on . It is likely it will not work properly on other platforms.*
+
+* **base_url** - The base URL of the CKAN portal before the collection ID. e.g. "https://catalog.data.gov/dataset/"
+* **collection_id** - The ID of the parent package. This can be found by querying the parent package and using the "id" key, or by navigating to the list of child packages and looking in the URL. e.g. In the collection_id is "7b1d1941-b255-4596-89a6-99e1a33cc2d8"
+
+### Return
+
+List of Package objects representing the child packages associated with the collection.
diff --git a/source_collectors/ckan/ckan_scraper_toolkit.py b/source_collectors/ckan/ckan_scraper_toolkit.py
new file mode 100644
index 00000000..b441c039
--- /dev/null
+++ b/source_collectors/ckan/ckan_scraper_toolkit.py
@@ -0,0 +1,206 @@
+"""Toolkit of functions that use ckanapi to retrieve packages from CKAN data portals"""
+
+from concurrent.futures import as_completed, ThreadPoolExecutor
+from dataclasses import dataclass, field
+from datetime import datetime
+import math
+import sys
+
+import time
+from typing import Any, Optional
+from urllib.parse import urljoin
+
+from bs4 import BeautifulSoup
+from ckanapi import RemoteCKAN
+import requests
+
+
+@dataclass
+class Package:
+ """
+ A class representing a CKAN package (dataset).
+ """
+ base_url: str = ""
+ url: str = ""
+ title: str = ""
+ agency_name: str = ""
+ description: str = ""
+ supplying_entity: str = ""
+ record_format: list = field(default_factory=lambda: [])
+ data_portal_type: str = ""
+ source_last_updated: str = ""
+
+ def to_dict(self):
+ """
+ Returns a dictionary representation of the package.
+ """
+ return {
+ "source_url": self.url,
+ "submitted_name": self.title,
+ "agency_name": self.agency_name,
+ "description": self.description,
+ "supplying_entity": self.supplying_entity,
+ "record_format": self.record_format,
+ "data_portal_type": self.data_portal_type,
+ "source_last_updated": self.source_last_updated,
+ }
+
+
+def ckan_package_search(
+ base_url: str,
+ query: Optional[str] = None,
+ rows: Optional[int] = sys.maxsize,
+ start: Optional[int] = 0,
+ **kwargs,
+) -> list[dict[str, Any]]:
+ """Performs a CKAN package (dataset) search from a CKAN data catalog URL.
+
+ :param base_url: Base URL to search from. e.g. "https://catalog.data.gov/"
+ :param query: Search string, defaults to None. None will return all packages.
+ :param rows: Maximum number of results to return, defaults to maximum integer.
+ :param start: Offsets the results, defaults to 0.
+ :param kwargs: See https://docs.ckan.org/en/2.10/api/index.html#ckan.logic.action.get.package_search for additional arguments.
+ :return: List of dictionaries representing the CKAN package search results.
+ """
+ remote = RemoteCKAN(base_url, get_only=True)
+ results = []
+ offset = start
+ rows_max = 1000 # CKAN's package search has a hard limit of 1000 packages returned at a time by default
+
+ while start < rows:
+ num_rows = rows - start + offset
+ packages = remote.action.package_search(
+ q=query, rows=num_rows, start=start, **kwargs
+ )
+ # Add the base_url to each package
+ [package.update(base_url=base_url) for package in packages["results"]]
+ results += packages["results"]
+
+ total_results = packages["count"]
+ if rows > total_results:
+ rows = total_results
+
+ result_len = len(packages["results"])
+ # Check if the website has a different rows_max value than CKAN's default
+ if result_len != rows_max and start + rows_max < total_results:
+ rows_max = result_len
+
+ start += rows_max
+
+ return results
+
+
+def ckan_package_search_from_organization(
+ base_url: str, organization_id: str
+) -> list[dict[str, Any]]:
+ """Returns a list of CKAN packages from an organization. Only 10 packages are able to be returned.
+
+ :param base_url: Base URL of the CKAN portal. e.g. "https://catalog.data.gov/"
+ :param organization_id: The organization's ID.
+ :return: List of dictionaries representing the packages associated with the organization.
+ """
+ remote = RemoteCKAN(base_url, get_only=True)
+ organization = remote.action.organization_show(
+ id=organization_id, include_datasets=True
+ )
+ packages = organization["packages"]
+ results = []
+
+ for package in packages:
+ query = f"id:{package['id']}"
+ results += ckan_package_search(base_url=base_url, query=query)
+
+ return results
+
+
+def ckan_group_package_show(
+ base_url: str, id: str, limit: Optional[int] = sys.maxsize
+) -> list[dict[str, Any]]:
+ """Returns a list of CKAN packages from a group.
+
+ :param base_url: Base URL of the CKAN portal. e.g. "https://catalog.data.gov/"
+ :param id: The group's ID.
+ :param limit: Maximum number of results to return, defaults to maximum integer.
+ :return: List of dictionaries representing the packages associated with the group.
+ """
+ remote = RemoteCKAN(base_url, get_only=True)
+ results = remote.action.group_package_show(id=id, limit=limit)
+ # Add the base_url to each package
+ [package.update(base_url=base_url) for package in results]
+ return results
+
+
+def ckan_collection_search(base_url: str, collection_id: str) -> list[Package]:
+ """Returns a list of CKAN packages from a collection.
+
+ :param base_url: Base URL of the CKAN portal before the collection ID. e.g. "https://catalog.data.gov/dataset/"
+ :param collection_id: The ID of the parent package.
+ :return: List of Package objects representing the packages associated with the collection.
+ """
+ packages = []
+ url = f"{base_url}?collection_package_id={collection_id}"
+ soup = _get_soup(url)
+
+ # Calculate the total number of pages of packages
+ num_results = int(soup.find(class_="new-results").text.split()[0].replace(",", ""))
+ pages = math.ceil(num_results / 20)
+
+ for page in range(1, pages + 1):
+ url = f"{base_url}?collection_package_id={collection_id}&page={page}"
+ soup = _get_soup(url)
+
+ with ThreadPoolExecutor(max_workers=10) as executor:
+ futures = [
+ executor.submit(
+ _collection_search_get_package_data, dataset_content, base_url
+ )
+ for dataset_content in soup.find_all(class_="dataset-content")
+ ]
+
+ [packages.append(package.result()) for package in as_completed(futures)]
+
+ # Take a break to avoid being timed out
+ if len(futures) >= 15:
+ time.sleep(10)
+
+ return packages
+
+
+def _collection_search_get_package_data(dataset_content, base_url: str):
+ """Parses the dataset content and returns a Package object."""
+ package = Package()
+ joined_url = urljoin(base_url, dataset_content.a.get("href"))
+ dataset_soup = _get_soup(joined_url)
+ # Determine if the dataset url should be the linked page to an external site or the current site
+ resources = dataset_soup.find("section", id="dataset-resources").find_all(
+ class_="resource-item"
+ )
+ button = resources[0].find(class_="btn-group")
+ if len(resources) == 1 and button is not None and button.a.text == "Visit page":
+ package.url = button.a.get("href")
+ else:
+ package.url = joined_url
+ package.data_portal_type = "CKAN"
+ package.base_url = base_url
+ package.title = dataset_soup.find(itemprop="name").text.strip()
+ package.agency_name = dataset_soup.find("h1", class_="heading").text.strip()
+ package.supplying_entity = dataset_soup.find(property="dct:publisher").text.strip()
+ package.description = dataset_soup.find(class_="notes").p.text
+ package.record_format = [
+ record_format.text.strip() for record_format in dataset_content.find_all("li")
+ ]
+ package.record_format = list(set(package.record_format))
+
+ date = dataset_soup.find(property="dct:modified").text.strip()
+ package.source_last_updated = datetime.strptime(date, "%B %d, %Y").strftime(
+ "%Y-%d-%m"
+ )
+
+ return package
+
+
+def _get_soup(url: str) -> BeautifulSoup:
+ """Returns a BeautifulSoup object for the given URL."""
+ time.sleep(1)
+ response = requests.get(url)
+ return BeautifulSoup(response.content, "lxml")
diff --git a/source_collectors/ckan/requirements.txt b/source_collectors/ckan/requirements.txt
new file mode 100644
index 00000000..fc41154b
--- /dev/null
+++ b/source_collectors/ckan/requirements.txt
@@ -0,0 +1,6 @@
+from_root
+ckanapi
+bs4
+lxml
+tqdm
+pandas
\ No newline at end of file
diff --git a/source_collectors/ckan/scrape_ckan_data_portals.py b/source_collectors/ckan/scrape_ckan_data_portals.py
new file mode 100644
index 00000000..2bb7733b
--- /dev/null
+++ b/source_collectors/ckan/scrape_ckan_data_portals.py
@@ -0,0 +1,287 @@
+"""Retrieves packages from CKAN data portals and parses relevant information then outputs to a CSV file"""
+
+from itertools import chain
+import sys
+from typing import Any, Callable, Optional
+
+from from_root import from_root
+import pandas as pd
+from tqdm import tqdm
+
+p = from_root("CONTRIBUTING.md").parent
+sys.path.insert(1, str(p))
+
+from scrapers_library.data_portals.ckan.ckan_scraper_toolkit import (
+ ckan_package_search,
+ ckan_group_package_show,
+ ckan_collection_search,
+ ckan_package_search_from_organization,
+ Package,
+)
+from search_terms import package_search, group_search, organization_search
+
+
+def perform_search(
+ search_func: Callable,
+ search_terms: list[dict[str, Any]],
+ results: list[dict[str, Any]],
+):
+ """Executes a search function with the given search terms.
+
+ :param search_func: The search function to execute.
+ :param search_terms: The list of urls and search terms.
+ :param results: The list of results.
+ :return: Updated list of results.
+ """
+ key = list(search_terms[0].keys())[1]
+ for search in tqdm(search_terms):
+ results += [search_func(search["url"], item) for item in search[key]]
+
+ return results
+
+
+def get_collection_child_packages(
+ results: list[dict[str, Any]]
+) -> list[dict[str, Any]]:
+ """Retrieves the child packages of each collection.
+
+ :param results: List of results.
+ :return: List of results containing child packages.
+ """
+ new_list = []
+
+ for result in tqdm(results):
+ if "extras" in result.keys():
+ collections = [
+ ckan_collection_search(
+ base_url="https://catalog.data.gov/dataset/",
+ collection_id=result["id"],
+ )
+ for extra in result["extras"]
+ if extra["key"] == "collection_metadata"
+ and extra["value"] == "true"
+ and not result["resources"]
+ ]
+
+ if collections:
+ new_list += collections[0]
+ continue
+
+ new_list.append(result)
+
+ return new_list
+
+
+def filter_result(result: dict[str, Any] | Package):
+ """Filters the result based on the defined criteria.
+
+ :param result: The result to filter.
+ :return: True if the result should be included, False otherwise.
+ """
+ if isinstance(result, Package) or "extras" not in result.keys():
+ return True
+
+ for extra in result["extras"]:
+ # Remove parent packages with no resources
+ if (
+ extra["key"] == "collection_metadata"
+ and extra["value"] == "true"
+ and not result["resources"]
+ ):
+ return False
+ # Remove non-public packages
+ elif extra["key"] == "accessLevel" and extra["value"] == "non-public":
+ return False
+
+ # Remove packages with no data or landing page
+ if len(result["resources"]) == 0:
+ landing_page = next(
+ (extra for extra in result["extras"] if extra["key"] == "landingPage"), None
+ )
+ if landing_page is None:
+ return False
+
+ return True
+
+
+def parse_result(result: dict[str, Any] | Package) -> dict[str, Any]:
+ """Retrieves the important information from the package.
+
+ :param result: The result to parse.
+ :return: The parsed result as a dictionary.
+ """
+ package = Package()
+
+ if isinstance(result, Package):
+ package.record_format = get_record_format_list(package)
+ return package.to_dict()
+
+ package.record_format = get_record_format_list(
+ package=package, resources=result["resources"]
+ )
+
+ package = get_source_url(result, package)
+ package.title = result["title"]
+ package.description = result["notes"]
+ package.agency_name = result["organization"]["title"]
+ package.supplying_entity = get_supplying_entity(result)
+ package.source_last_updated = result["metadata_modified"][0:10]
+
+ return package.to_dict()
+
+
+def get_record_format_list(
+ package: Package,
+ resources: Optional[list[dict[str, Any]]] = None,
+) -> list[str]:
+ """Retrieves the record formats from the package's resources.
+
+ :param package: The package to retrieve record formats from.
+ :param resources: The list of resources.
+ :return: List of record formats.
+ """
+ data_types = [
+ "CSV",
+ "PDF",
+ "XLS",
+ "XML",
+ "JSON",
+ "Other",
+ "RDF",
+ "GIS / Shapefile",
+ "HTML text",
+ "DOC / TXT",
+ "Video / Image",
+ ]
+ type_conversion = {
+ "XLSX": "XLS",
+ "Microsoft Excel": "XLS",
+ "KML": "GIS / Shapefile",
+ "GeoJSON": "GIS / Shapefile",
+ "application/vnd.geo+json": "GIS / Shapefile",
+ "ArcGIS GeoServices REST API": "GIS / Shapefile",
+ "Esri REST": "GIS / Shapefile",
+ "SHP": "GIS / Shapefile",
+ "OGC WMS": "GIS / Shapefile",
+ "QGIS": "GIS / Shapefile",
+ "gml": "GIS / Shapefile",
+ "WFS": "GIS / Shapefile",
+ "WMS": "GIS / Shapefile",
+ "API": "GIS / Shapefile",
+ "HTML": "HTML text",
+ "HTML page": "HTML text",
+ "": "HTML text",
+ "TEXT": "DOC / TXT",
+ "JPEG": "Video / Image",
+ "Api": "JSON",
+ "CSV downloads": "CSV",
+ "csv file": "CSV",
+ }
+
+ if resources is None:
+ resources = package.record_format
+ package.record_format = []
+
+ for resource in resources:
+ if isinstance(resource, str):
+ format = resource
+ else:
+ format = resource["format"]
+
+ # Is the format one of our conversion types?
+ if format in type_conversion.keys():
+ format = type_conversion[format]
+
+ # Add the format to the package's record format list if it's not already there and is a valid data type
+ if format not in package.record_format and format in data_types:
+ package.record_format.append(format)
+
+ if format not in data_types:
+ package.record_format.append("Other")
+
+ return package.record_format
+
+
+def get_source_url(result: dict[str, Any], package: Package) -> Package:
+ """Retrieves the source URL from the package's resources.
+
+ :param result: The result to retrieve source URL from.
+ :param package: The package to update with the source URL.
+ :return: The updated package.
+ """
+ # If there is only one resource available and it's a link
+ if len(result["resources"]) == 1 and package.record_format == ["HTML text"]:
+ # Use the link to the external page
+ package.url = result["resources"][0]["url"]
+ # If there are no resources available
+ elif len(result["resources"]) == 0:
+ # Use the dataset's external landing page
+ package.url = [
+ extra["value"]
+ for extra in result["extras"]
+ if extra["key"] == "landingPage"
+ ]
+ package.record_format = ["HTML text"]
+ else:
+ # Use the package's dataset information page
+ package.url = f"{result['base_url']}dataset/{result['name']}"
+ package.data_portal_type = "CKAN"
+
+ return package
+
+
+def get_supplying_entity(result: dict[str, Any]) -> str:
+ """Retrieves the supplying entity from the package's extras.
+
+ :param result: The result to retrieve supplying entity from.
+ :return: The supplying entity.
+ """
+ if "extras" not in result.keys():
+ return result["organization"]["title"]
+
+ for extra in result["extras"]:
+ if extra["key"] == "publisher":
+ return extra["value"]
+
+ return result["organization"]["title"]
+
+
+def main():
+ """
+ Main function.
+ """
+ results = []
+
+ print("Gathering results...")
+ results = perform_search(
+ search_func=ckan_package_search,
+ search_terms=package_search,
+ results=results,
+ )
+ results = perform_search(
+ search_func=ckan_group_package_show,
+ search_terms=group_search,
+ results=results,
+ )
+ results = perform_search(
+ search_func=ckan_package_search_from_organization,
+ search_terms=organization_search,
+ results=results,
+ )
+
+ flat_list = list(chain(*results))
+ # Deduplicate entries
+ flat_list = [i for n, i in enumerate(flat_list) if i not in flat_list[n + 1 :]]
+ print("\nRetrieving collections...")
+ flat_list = get_collection_child_packages(flat_list)
+
+ filtered_results = list(filter(filter_result, flat_list))
+ parsed_results = list(map(parse_result, filtered_results))
+
+ # Write to CSV
+ df = pd.DataFrame(parsed_results)
+ df.to_csv("results.csv")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/source_collectors/ckan/search_terms.py b/source_collectors/ckan/search_terms.py
new file mode 100644
index 00000000..716d68f7
--- /dev/null
+++ b/source_collectors/ckan/search_terms.py
@@ -0,0 +1,36 @@
+"""
+CKAN search terms
+"""
+
+package_search = [
+ {
+ "url": "https://catalog.data.gov/",
+ "terms": [
+ "police",
+ "crime",
+ "tags:(court courts court-cases criminal-justice-system law-enforcement law-enforcement-agencies)",
+ ],
+ },
+ {"url": "https://data.boston.gov/", "terms": ["police"]},
+ {"url": "https://open.jacksonms.gov/", "terms": ["tags:police"]},
+ {"url": "https://data.milwaukee.gov/", "terms": ["mpd", "wibr"]},
+ {"url": "https://data.sanantonio.gov/", "terms": ["sapd"]},
+ {"url": "https://data.sanjoseca.gov/", "terms": ["police"]},
+]
+
+group_search = [
+ {
+ "url": "https://data.birminghamal.gov/",
+ "ids": [
+ "3c648d96-0a29-4deb-aa96-150117119a23",
+ "92654c61-3a7d-484f-a146-257c0f6c55aa",
+ ],
+ },
+]
+
+organization_search = [
+ {
+ "url": "https://data.houstontx.gov/",
+ "ids": ["d6f4346d-f298-498d-b8dd-a4b95ee0846b"],
+ },
+]
diff --git a/common_crawler/README.md b/source_collectors/common_crawler/README.md
similarity index 100%
rename from common_crawler/README.md
rename to source_collectors/common_crawler/README.md
diff --git a/common_crawler/__init__.py b/source_collectors/common_crawler/__init__.py
similarity index 100%
rename from common_crawler/__init__.py
rename to source_collectors/common_crawler/__init__.py
diff --git a/common_crawler/argparser.py b/source_collectors/common_crawler/argparser.py
similarity index 65%
rename from common_crawler/argparser.py
rename to source_collectors/common_crawler/argparser.py
index 8cdf5b78..67f4a290 100644
--- a/common_crawler/argparser.py
+++ b/source_collectors/common_crawler/argparser.py
@@ -7,6 +7,7 @@
for the Common Crawler script.
"""
+
def valid_common_crawl_id(common_crawl_id: str) -> bool:
"""
Validate the Common Crawl ID format.
@@ -16,7 +17,8 @@ def valid_common_crawl_id(common_crawl_id: str) -> bool:
Returns:
True if the Common Crawl ID is valid, False otherwise
"""
- return re.match(r'CC-MAIN-\d{4}-\d{2}', common_crawl_id) is not None
+ return re.match(r"CC-MAIN-\d{4}-\d{2}", common_crawl_id) is not None
+
def parse_args() -> argparse.Namespace:
"""
@@ -33,22 +35,41 @@ def parse_args() -> argparse.Namespace:
"""
parser = argparse.ArgumentParser(
- description='Query the Common Crawl dataset and optionally save the results to a file.')
+ description="Query the Common Crawl dataset and optionally save the results to a file."
+ )
# Add the required arguments
- parser.add_argument('common_crawl_id', type=str, help='The Common Crawl ID')
- parser.add_argument('url', type=str, help='The URL to query')
- parser.add_argument('keyword', type=str, help='The keyword to search in the url')
+ parser.add_argument("common_crawl_id", type=str, help="The Common Crawl ID")
+ parser.add_argument("url", type=str, help="The URL to query")
+ parser.add_argument("keyword", type=str, help="The keyword to search in the url")
# Optional arguments for the number of pages and the output file, and a flag to reset the cache
- parser.add_argument('-c', '--config', type=str, default='config.ini', help='The configuration file to use')
- parser.add_argument('-p', '--pages', type=int, default=1, help='The number of pages to search (default: 1)')
- parser.add_argument('--reset-cache', action='store_true', default=False,
- help='Reset the cache before starting the crawl')
+ parser.add_argument(
+ "-c",
+ "--config",
+ type=str,
+ default="config.ini",
+ help="The configuration file to use",
+ )
+ parser.add_argument(
+ "-p",
+ "--pages",
+ type=int,
+ default=1,
+ help="The number of pages to search (default: 1)",
+ )
+ parser.add_argument(
+ "--reset-cache",
+ action="store_true",
+ default=False,
+ help="Reset the cache before starting the crawl",
+ )
args = parser.parse_args()
# Validate the Common Crawl ID format
if not valid_common_crawl_id(args.common_crawl_id):
- parser.error("Invalid Common Crawl ID format. Expected format is CC-MAIN-YYYY-WW.")
+ parser.error(
+ "Invalid Common Crawl ID format. Expected format is CC-MAIN-YYYY-WW."
+ )
# Read the configuration file
config = configparser.ConfigParser()
@@ -56,7 +77,7 @@ def parse_args() -> argparse.Namespace:
# Combine parsed arguments with configuration file defaults
app_parser = argparse.ArgumentParser(parents=[parser], add_help=False)
- app_parser.set_defaults(**config['DEFAULT'])
+ app_parser.set_defaults(**config["DEFAULT"])
app_args = app_parser.parse_args()
diff --git a/common_crawler/cache.py b/source_collectors/common_crawler/cache.py
similarity index 92%
rename from common_crawler/cache.py
rename to source_collectors/common_crawler/cache.py
index 2a48c0b7..23d58819 100644
--- a/common_crawler/cache.py
+++ b/source_collectors/common_crawler/cache.py
@@ -8,11 +8,13 @@
- CommonCrawlerCache: a class for managing the cache logic of Common Crawl search results
"""
+
class CommonCrawlerCacheManager:
"""
A class for managing the cache of Common Crawl search results.
This class is responsible for adding, retrieving, and saving cache data.
"""
+
def __init__(self, file_name: str = "cache", directory=None):
"""
Initializes the CacheStorage object with a file name and directory.
@@ -41,7 +43,6 @@ def upsert(self, index: str, url: str, keyword: str, last_page: int) -> None:
self.cache[index][url] = {}
self.cache[index][url][keyword] = last_page
-
def get(self, index, url, keyword) -> int:
"""
Retrieves a page number from the cache.
@@ -53,12 +54,15 @@ def get(self, index, url, keyword) -> int:
Returns: int - the last page crawled
"""
- if index in self.cache and url in self.cache[index] and keyword in self.cache[index][url]:
+ if (
+ index in self.cache
+ and url in self.cache[index]
+ and keyword in self.cache[index][url]
+ ):
return self.cache[index][url][keyword]
# The cache object does not exist. Return 0 as the default value.
return 0
-
def load_or_create_cache(self) -> dict:
"""
Loads the cache from the configured file path.
@@ -66,12 +70,11 @@ def load_or_create_cache(self) -> dict:
Returns: dict - the cache data
"""
try:
- with open(self.file_path, 'r') as file:
+ with open(self.file_path, "r") as file:
return json.load(file)
except FileNotFoundError:
return {}
-
def save_cache(self) -> None:
"""
Converts the cache object into a JSON-serializable format and saves it to the configured file path.
@@ -79,10 +82,9 @@ def save_cache(self) -> None:
persistence of crawl data across sessions.
"""
# Reformat cache data for JSON serialization
- with open(self.file_path, 'w') as file:
+ with open(self.file_path, "w") as file:
json.dump(self.cache, file, indent=4)
-
def reset_cache(self) -> None:
"""
Resets the cache to an empty state.
diff --git a/common_crawler/config.ini b/source_collectors/common_crawler/config.ini
similarity index 100%
rename from common_crawler/config.ini
rename to source_collectors/common_crawler/config.ini
diff --git a/common_crawler/crawler.py b/source_collectors/common_crawler/crawler.py
similarity index 66%
rename from common_crawler/crawler.py
rename to source_collectors/common_crawler/crawler.py
index 9afba7d8..b77e670f 100644
--- a/common_crawler/crawler.py
+++ b/source_collectors/common_crawler/crawler.py
@@ -16,9 +16,14 @@
# TODO: What happens when no results are found? How does the CommonCrawlerManager handle this?
-
@dataclass
class CommonCrawlResult:
+ """
+ A class to hold the results of a Common Crawl search.
+ Args:
+ last_page_search: the last page searched
+ url_results: the list of URLs found in the search
+ """
last_page_search: int
url_results: list[str]
@@ -31,16 +36,30 @@ class CommonCrawlerManager:
It validates crawl ids, manages pagination, and aggregates results.
"""
- def __init__(self, crawl_id='CC-MAIN-2023-50'):
+ def __init__(self, crawl_id="CC-MAIN-2023-50"):
+ """
+ Initializes the CommonCrawlerManager with a crawl ID.
+ Args:
+ crawl_id: the Common Crawl index to use
+ """
self.crawl_id = crawl_id
- CC_INDEX_SERVER = 'http://index.commoncrawl.org/'
- INDEX_NAME = f'{self.crawl_id}-index'
- self.root_url = f'{CC_INDEX_SERVER}{INDEX_NAME}'
+ CC_INDEX_SERVER = "http://index.commoncrawl.org/"
+ INDEX_NAME = f"{self.crawl_id}-index"
+ self.root_url = f"{CC_INDEX_SERVER}{INDEX_NAME}"
def crawl(self, search_term, keyword, start_page, num_pages) -> CommonCrawlResult:
+ """
+ Crawls the Common Crawl index for a given search term and keyword.
+ Args:
+ search_term: the term to search for
+ keyword: the keyword to search for
+ start_page: the page to start the search from
+ num_pages: the number of pages to search
+ """
print(
f"Searching for {keyword} on {search_term} in {self.crawl_id} for {num_pages} pages,"
- f" starting at page {start_page}")
+ f" starting at page {start_page}"
+ )
url_results = []
@@ -64,7 +83,9 @@ def crawl(self, search_term, keyword, start_page, num_pages) -> CommonCrawlResul
return CommonCrawlResult(last_page, url_results)
- def search_common_crawl_index(self, url: str, page: int = 0, max_retries: int = 20) -> list[dict]:
+ def search_common_crawl_index(
+ self, url: str, page: int = 0, max_retries: int = 20
+ ) -> list[dict]:
"""
This method is used to search the Common Crawl index for a given URL and page number
Args:
@@ -76,9 +97,9 @@ def search_common_crawl_index(self, url: str, page: int = 0, max_retries: int =
"""
encoded_url = quote_plus(url)
search_url = URLWithParameters(self.root_url)
- search_url.add_parameter('url', encoded_url)
- search_url.add_parameter('output', 'json')
- search_url.add_parameter('page', page)
+ search_url.add_parameter("url", encoded_url)
+ search_url.add_parameter("output", "json")
+ search_url.add_parameter("page", page)
retries = 0
delay = 1
@@ -90,7 +111,9 @@ def search_common_crawl_index(self, url: str, page: int = 0, max_retries: int =
return self.process_response(response, url, page)
retries += 1
- print(f"Rate limit exceeded. Retrying in {delay} second(s)... (Attempt {retries}/{max_retries})")
+ print(
+ f"Rate limit exceeded. Retrying in {delay} second(s)... (Attempt {retries}/{max_retries})"
+ )
time.sleep(delay)
print(f"Max retries exceeded. Failed to get records for {url} on page {page}.")
@@ -106,19 +129,24 @@ def make_request(self, search_url: str) -> requests.Response:
response.raise_for_status()
return response
except requests.exceptions.RequestException as e:
- if response.status_code == HTTPStatus.INTERNAL_SERVER_ERROR and 'SlowDown' in response.text:
+ if (
+ response.status_code == HTTPStatus.INTERNAL_SERVER_ERROR
+ and "SlowDown" in response.text
+ ):
return None
else:
print(f"Failed to get records: {e}")
return None
- def process_response(self, response: requests.Response, url: str, page: int) -> list[dict]:
+ def process_response(
+ self, response: requests.Response, url: str, page: int
+ ) -> list[dict]:
"""Processes the HTTP response and returns the parsed records if successful."""
if response.status_code == HTTPStatus.OK:
- records = response.text.strip().split('\n')
+ records = response.text.strip().split("\n")
print(f"Found {len(records)} records for {url} on page {page}")
return [json.loads(record) for record in records]
- elif 'First Page is 0, Last Page is 0' in response.text:
+ elif "First Page is 0, Last Page is 0" in response.text:
print("No records exist in index matching the url search term")
return None
else:
@@ -127,4 +155,7 @@ def process_response(self, response: requests.Response, url: str, page: int) ->
@staticmethod
def get_urls_with_keyword(records: list[dict], keyword) -> list[str]:
- return [record['url'] for record in records if keyword in record['url']]
+ """
+ Returns a list of URLs that contain the given keyword
+ """
+ return [record["url"] for record in records if keyword in record["url"]]
diff --git a/common_crawler/csv_manager.py b/source_collectors/common_crawler/csv_manager.py
similarity index 74%
rename from common_crawler/csv_manager.py
rename to source_collectors/common_crawler/csv_manager.py
index 69868629..5a80aeaa 100644
--- a/common_crawler/csv_manager.py
+++ b/source_collectors/common_crawler/csv_manager.py
@@ -10,12 +10,13 @@ class CSVManager:
Creates the file if it doesn't exist, and provides a method for adding new rows.
"""
- def __init__(
- self,
- file_name: str,
- headers: list[str],
- directory=None
- ):
+ def __init__(self, file_name: str, headers: list[str], directory=None):
+ """
+ Args:
+ file_name: the name of the CSV file
+ headers: the headers for the CSV file
+ directory: the directory to store the CSV file
+ """
self.file_path = get_file_path(f"{file_name}.csv", directory)
self.headers = headers
if not os.path.exists(self.file_path):
@@ -29,9 +30,9 @@ def add_row(self, row_values: list[str] | tuple[str]):
"""
if isinstance(row_values, str):
# Single values must be converted to a list format
- row_values = [row_values]
+ row_values = [row_values]
try:
- with open(self.file_path, mode='a', newline='', encoding='utf-8') as file:
+ with open(self.file_path, mode="a", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(row_values)
except Exception as e:
@@ -45,9 +46,7 @@ def add_rows(self, results: list[list[str]]) -> None:
Returns: None
"""
for result in results:
- self.add_row(
- result
- )
+ self.add_row(result)
print(f"{len(results)} URLs written to {self.file_path}")
def initialize_file(self):
@@ -59,15 +58,17 @@ def initialize_file(self):
file_exists = os.path.isfile(self.file_path)
if not file_exists:
- with open(self.file_path, mode='a', newline='', encoding='utf-8') as file:
+ with open(self.file_path, mode="a", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(self.headers)
else:
# Open and check that headers match
- with open(self.file_path, mode='r', encoding='utf-8') as file:
+ with open(self.file_path, mode="r", encoding="utf-8") as file:
header_row = next(csv.reader(file))
if header_row != self.headers:
- raise ValueError(f"Header row in {self.file_path} does not match expected headers")
+ raise ValueError(
+ f"Header row in {self.file_path} does not match expected headers"
+ )
print(f"CSV file initialized at {self.file_path}")
def delete_file(self):
diff --git a/common_crawler/data/cache.json b/source_collectors/common_crawler/data/cache.json
similarity index 100%
rename from common_crawler/data/cache.json
rename to source_collectors/common_crawler/data/cache.json
diff --git a/common_crawler/data/urls.csv b/source_collectors/common_crawler/data/urls.csv
similarity index 100%
rename from common_crawler/data/urls.csv
rename to source_collectors/common_crawler/data/urls.csv
diff --git a/common_crawler/main.py b/source_collectors/common_crawler/main.py
similarity index 81%
rename from common_crawler/main.py
rename to source_collectors/common_crawler/main.py
index ae27f556..a83b0aee 100644
--- a/common_crawler/main.py
+++ b/source_collectors/common_crawler/main.py
@@ -10,7 +10,7 @@
# The below code sets the working directory to be the root of the entire repository
# This is done to solve otherwise quite annoying import issues.
-sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
from util.huggingface_api_manager import HuggingFaceAPIManager
from util.miscellaneous_functions import get_filename_friendly_timestamp
@@ -28,6 +28,9 @@
@dataclasses.dataclass
class BatchInfo:
+ """
+ Dataclass for batch info
+ """
datetime: str
source: str
count: str
@@ -35,30 +38,40 @@ class BatchInfo:
notes: str
filename: str
+
class LabelStudioError(Exception):
"""Custom exception for Label Studio Errors"""
+
pass
-BATCH_HEADERS = ['Datetime', 'Source', 'Count', 'Keywords', 'Notes', 'Filename']
+
+BATCH_HEADERS = ["Datetime", "Source", "Count", "Keywords", "Notes", "Filename"]
+
def get_current_time():
+ """
+ Returns the current time
+ """
return str(datetime.now())
-def add_batch_info_to_csv(common_crawl_result: CommonCrawlResult, args: argparse.Namespace, last_page: int) -> BatchInfo:
+def add_batch_info_to_csv(
+ common_crawl_result: CommonCrawlResult, args: argparse.Namespace, last_page: int
+) -> BatchInfo:
+ """
+ Adds batch info to CSV
+ """
batch_info = BatchInfo(
datetime=get_current_time(),
source="Common Crawl",
count=str(len(common_crawl_result.url_results)),
keywords=f"{args.url} - {args.keyword}",
notes=f"{args.common_crawl_id}, {args.pages} pages, starting at {last_page + 1}",
- filename=f"{args.output_filename}_{get_filename_friendly_timestamp()}"
+ filename=f"{args.output_filename}_{get_filename_friendly_timestamp()}",
)
batch_info_csv_manager = CSVManager(
- file_name='batch_info',
- directory=args.data_dir,
- headers=BATCH_HEADERS
+ file_name="batch_info", directory=args.data_dir, headers=BATCH_HEADERS
)
batch_info_csv_manager.add_row(dataclasses.astuple(batch_info))
@@ -66,17 +79,19 @@ def add_batch_info_to_csv(common_crawl_result: CommonCrawlResult, args: argparse
def main():
+ """
+ Main function
+ """
# Parse the arguments
args = parse_args()
# Initialize the Cache
cache_manager = CommonCrawlerCacheManager(
- file_name=args.cache_filename,
- directory=args.data_dir
+ file_name=args.cache_filename, directory=args.data_dir
)
load_dotenv()
-
+
# Initialize the HuggingFace API Manager
hf_access_token = os.getenv("HUGGINGFACE_ACCESS_TOKEN")
if not hf_access_token:
@@ -84,10 +99,10 @@ def main():
"HUGGINGFACE_ACCESS_TOKEN not accessible in .env file in root directory. "
"Please obtain access token from your personal account at "
"https://huggingface.co/settings/tokens and ensure you have write access to "
- "https://huggingface.co/PDAP. Then include in .env file in root directory.")
+ "https://huggingface.co/PDAP. Then include in .env file in root directory."
+ )
huggingface_api_manager = HuggingFaceAPIManager(
- access_token=hf_access_token,
- repo_id=args.huggingface_repo_id
+ access_token=hf_access_token, repo_id=args.huggingface_repo_id
)
ls_access_token = os.getenv("LABEL_STUDIO_ACCESS_TOKEN")
if not ls_access_token:
@@ -95,13 +110,15 @@ def main():
"LABEL_STUDIO_ACCESS_TOKEN not accessible in .env file in root directory. "
"Please obtain access token from your personal account at "
"https://app.heartex.com/user/account and ensure you have read access to "
- "https://app.heartex.com/projects/61550. Then include in .env file in root directory.")
+ "https://app.heartex.com/projects/61550. Then include in .env file in root directory."
+ )
ls_project_id = os.getenv("LABEL_STUDIO_PROJECT_ID")
if not ls_project_id:
raise ValueError(
"LABEL_STUDIO_PROJECT_ID not accessible in .env file in root directory. "
"Please obtain a project ID by navigating to the Label Studio project "
- "where it will be visibile in the url. Then include in .env file in root directory.")
+ "where it will be visibile in the url. Then include in .env file in root directory."
+ )
try:
print("Retrieving Label Studio data for deduplication")
@@ -119,7 +136,9 @@ def main():
try:
# Retrieve the last page from the cache, or 0 if it does not exist
last_page = cache_manager.get(args.common_crawl_id, args.url, args.keyword)
- common_crawl_result = process_crawl_and_upload(args, last_page, huggingface_api_manager, label_studio_results)
+ common_crawl_result = process_crawl_and_upload(
+ args, last_page, huggingface_api_manager, label_studio_results
+ )
except ValueError as e:
print(f"Error during crawling: {e}")
return
@@ -129,12 +148,14 @@ def main():
index=args.common_crawl_id,
url=args.url,
keyword=args.keyword,
- last_page=common_crawl_result.last_page_search)
+ last_page=common_crawl_result.last_page_search,
+ )
cache_manager.save_cache()
except ValueError as e:
print(f"Error while saving cache manager: {e}")
+
def handle_remote_results_error(remote_results):
"""
Handles errors in the remote results
@@ -151,6 +172,7 @@ def handle_remote_results_error(remote_results):
else:
raise LabelStudioError(f"Unexpected error: {remote_results}")
+
def validate_remote_results(remote_results):
"""
Validates the remote results retrieved from the Label Studio project
@@ -166,7 +188,9 @@ def validate_remote_results(remote_results):
print("No data in Label Studio project.")
return []
elif "url" not in remote_results[0]["data"]:
- raise LabelStudioError("Column 'url' not present in Label Studio project. Exiting...")
+ raise LabelStudioError(
+ "Column 'url' not present in Label Studio project. Exiting..."
+ )
else:
return remote_results
elif isinstance(remote_results, dict):
@@ -174,6 +198,7 @@ def validate_remote_results(remote_results):
else:
raise LabelStudioError("Unexpected response type.")
+
def get_ls_data() -> list[dict] | None:
"""Retrieves data from a Label Studio project to be used in deduplication of common crawl results.
@@ -190,14 +215,14 @@ def get_ls_data() -> list[dict] | None:
def strip_url(url: str) -> str:
- """Strips http(s)://www. from the beginning of a url if applicable.
+ """Strips http(s)://www. from the beginning of a url if applicable.
Args:
url (str): The URL to strip.
Returns:
str: The stripped URL.
- """
+ """
result = re.search(r"^(?:https?://)?(?:www\.)?(.*)$", url).group(1)
return result
@@ -210,7 +235,7 @@ def remove_local_duplicates(url_results: list[str]) -> list[str]:
Returns:
list[str]: List of unique URLs.
- """
+ """
stripped_url_results = [strip_url(url) for url in url_results]
unique_urls = collections.deque()
adjust = 0
@@ -225,7 +250,9 @@ def remove_local_duplicates(url_results: list[str]) -> list[str]:
return url_results
-def remove_remote_duplicates(url_results: list[str], label_studio_data: list[dict]) -> list[str]:
+def remove_remote_duplicates(
+ url_results: list[str], label_studio_data: list[dict]
+) -> list[str]:
"""Removes URLs from a list that are already present in the Label Studio project, ignoring http(s)://www.
Args:
@@ -238,7 +265,9 @@ def remove_remote_duplicates(url_results: list[str], label_studio_data: list[dic
try:
remote_urls = [strip_url(task["data"]["url"]) for task in label_studio_data]
except TypeError:
- print("Invalid Label Studio credentials. Database could not be checked for duplicates.")
+ print(
+ "Invalid Label Studio credentials. Database could not be checked for duplicates."
+ )
return url_results
remote_urls = set(remote_urls)
@@ -254,10 +283,11 @@ def remove_remote_duplicates(url_results: list[str], label_studio_data: list[dic
def handle_csv_and_upload(
- common_crawl_result: CommonCrawlResult,
- huggingface_api_manager: HuggingFaceAPIManager,
- args: argparse.Namespace,
- last_page: int):
+ common_crawl_result: CommonCrawlResult,
+ huggingface_api_manager: HuggingFaceAPIManager,
+ args: argparse.Namespace,
+ last_page: int,
+):
"""
Handles the CSV file and uploads it to Hugging Face repository.
Args:
@@ -270,29 +300,30 @@ def handle_csv_and_upload(
batch_info = add_batch_info_to_csv(common_crawl_result, args, last_page)
csv_manager = CSVManager(
- file_name=batch_info.filename,
- headers=['url'],
- directory=args.data_dir
+ file_name=batch_info.filename, headers=["url"], directory=args.data_dir
)
csv_manager.add_rows(common_crawl_result.url_results)
huggingface_api_manager.upload_file(
local_file_path=csv_manager.file_path,
- repo_file_path=f"{args.output_filename}/{csv_manager.file_path.name}"
+ repo_file_path=f"{args.output_filename}/{csv_manager.file_path.name}",
)
print(
- f"Uploaded file to Hugging Face repo {huggingface_api_manager.repo_id} at {args.output_filename}/{csv_manager.file_path.name}")
+ f"Uploaded file to Hugging Face repo {huggingface_api_manager.repo_id} at {args.output_filename}/{csv_manager.file_path.name}"
+ )
csv_manager.delete_file()
def process_crawl_and_upload(
- args: argparse.Namespace,
- last_page: int,
- huggingface_api_manager: HuggingFaceAPIManager,
- label_studio_data: list[dict]) -> CommonCrawlResult:
+ args: argparse.Namespace,
+ last_page: int,
+ huggingface_api_manager: HuggingFaceAPIManager,
+ label_studio_data: list[dict],
+) -> CommonCrawlResult:
+ """
+ Processes a crawl and uploads the results to Hugging Face.
+ """
# Initialize the CommonCrawlerManager
- crawler_manager = CommonCrawlerManager(
- args.common_crawl_id
- )
+ crawler_manager = CommonCrawlerManager(args.common_crawl_id)
# Determine the pages to search, based on the last page searched
start_page = last_page + 1
# Use the parsed arguments
@@ -300,7 +331,7 @@ def process_crawl_and_upload(
search_term=args.url,
keyword=args.keyword,
num_pages=args.pages,
- start_page=start_page
+ start_page=start_page,
)
# Logic should conclude here if no results are found
if not common_crawl_result.url_results:
@@ -309,10 +340,16 @@ def process_crawl_and_upload(
return common_crawl_result
print("Removing urls already in the database")
- common_crawl_result.url_results = remove_local_duplicates(common_crawl_result.url_results)
- common_crawl_result.url_results = remove_remote_duplicates(common_crawl_result.url_results, label_studio_data)
+ common_crawl_result.url_results = remove_local_duplicates(
+ common_crawl_result.url_results
+ )
+ common_crawl_result.url_results = remove_remote_duplicates(
+ common_crawl_result.url_results, label_studio_data
+ )
if not common_crawl_result.url_results:
- print("No urls not already present in the database found. Ceasing main execution.")
+ print(
+ "No urls not already present in the database found. Ceasing main execution."
+ )
add_batch_info_to_csv(common_crawl_result, args, last_page)
return common_crawl_result
diff --git a/common_crawler/requirements_common_crawler_action.txt b/source_collectors/common_crawler/requirements_common_crawler_action.txt
similarity index 100%
rename from common_crawler/requirements_common_crawler_action.txt
rename to source_collectors/common_crawler/requirements_common_crawler_action.txt
diff --git a/common_crawler/utils.py b/source_collectors/common_crawler/utils.py
similarity index 68%
rename from common_crawler/utils.py
rename to source_collectors/common_crawler/utils.py
index 0848b023..8023d50d 100644
--- a/common_crawler/utils.py
+++ b/source_collectors/common_crawler/utils.py
@@ -9,14 +9,23 @@ class URLWithParameters:
"""
def __init__(self, url):
+ """
+ Initialize the URLWithParameters object with the given URL
+ """
self.url = url
def add_parameter(self, parameter, value):
- if '?' in self.url:
+ """
+ Add a parameter to the URL
+ """
+ if "?" in self.url:
self.url += f"&{parameter}={value}"
else:
self.url += f"?{parameter}={value}"
return self.url
def __str__(self):
+ """
+ Return the URL
+ """
return self.url
diff --git a/source_collectors/muckrock/.gitignore b/source_collectors/muckrock/.gitignore
new file mode 100644
index 00000000..3ad8c498
--- /dev/null
+++ b/source_collectors/muckrock/.gitignore
@@ -0,0 +1,229 @@
+# Project specific
+/Counties/Florida/Bay County/Scraper/attachments/*
+/Counties/Florida/Bay County/Scraper/captcha/correct/*
+/Counties/Florida/Bay County/Scraper/captcha/incorrect/*
+/scrapers_library/CA/san_bernardino_county/data
+
+# Ignore dolt repos (cloned from ETL)
+**/datasets
+**/data-intake
+
+# Python gitignore from: https://github.com/github/gitignore/blob/master/Python.gitignore
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+# Usually these files are written by a python script from a template
+# before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+# For a library or package, you might want to ignore these files since the code is
+# intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+# However, in case of collaboration, if having platform-specific dependencies or dependencies
+# having no cross-platform support, pipenv may install dependencies that don't work, or not
+# install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# Vim temp files
+## swap
+[._]*.s[a-v][a-z]
+[._]*.sw[a-p]
+[._]s[a-v][a-z]
+[._]sw[a-p]
+## session
+Session.vim
+## temporary
+.netrwhist
+*~
+
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+
+# IDE generated files
+.idea
+
+# Emacs temp files
+\#*\#
+/.emacs.desktop
+/.emacs.desktop.lock
+*.elc
+auto-save-list
+tramp
+.\#*
+
+## Org-mode
+.org-id-locations
+*_archive
+!incident_blotter_archive/
+
+## flymake-mode
+*_flymake.*
+
+## eshell files
+/eshell/history
+/eshell/lastdir
+
+## elpa packages
+/elpa/
+
+## reftex files
+*.rel
+
+## AUCTeX auto folder
+/auto/
+
+## cask packages
+.cask/
+dist/
+
+## Flycheck
+flycheck_*.el
+
+## server auth directory
+/server/
+
+## projectiles files
+.projectile
+
+## directory configuration
+.dir-locals.el
+
+.vscode
+/.vscode
+
+*.db
+*.json
+*.csv
+/csv
+last_page_fetched.txt
diff --git a/source_collectors/muckrock/README.md b/source_collectors/muckrock/README.md
new file mode 100644
index 00000000..d74b77f0
--- /dev/null
+++ b/source_collectors/muckrock/README.md
@@ -0,0 +1,90 @@
+# Muckrock Toolkit
+
+## Description
+
+This repo provides tools for searching Muckrock FOIA requests, it includes scripts for downloading data from MuckRock, generating CSV files per PDAP database requirements, and automatic labeling
+
+## Installation
+
+### 1. Clone the `scrapers` repository and navigate to the `muckrock_tools` directory.
+
+```
+git clone git@github.com:Police-Data-Accessibility-Project/scrapers.git
+cd scrapers/scrapers_library/data_portals/muckrock/muckrock_tools
+```
+
+### 2. Create a virtual environment.
+
+If you don't already have virtualenv, install the package:
+
+```
+
+pip install virtualenv
+
+```
+
+Then run the following command to create a virtual environment (ensure the python version is as below):
+
+```
+
+virtualenv -p python3.12 venv
+
+```
+
+### 3. Activate the virtual environment.
+
+```
+
+source venv/bin/activate
+
+```
+
+### 4. Install dependencies.
+
+```
+
+pip install -r requirements.txt
+
+```
+
+## Uses
+
+### 1. Simple Search Term
+
+- `muck_get.py`
+- script to perform searches on MuckRock's database, by matching a search string to title of request. Search is slow due to rate limiting (cannot multi thread around it).
+
+### 2. Clone Muckrock database & search locally
+
+~~- `download_muckrock_foia.py` `search_local_foia_json.py`~~ (deprecated)
+
+- scripts to clone the MuckRock foia requests collection for fast local querying (total size <2GB at present)
+
+- `create_foia_data_db.py` creates and populates a SQLite database (`foia_data.db`) with all MuckRock foia requests. Various errors outside the scope of this script may occur; a counter (`last_page_fetched.txt`) is created to keep track of the most recent page fetched and inserted into the database. If the program exits prematurely, simply run `create_foia_data_db.py` again to continue where you left off. A log file is created to capture errors for later reference.
+
+- After `foia_data.db` is created, run `search_foia_data_db.py`, which receives a search string as input and outputs a JSON file with all related FOIA requests for later processing by `generate_detailed_muckrock_csv.py`. For example,
+
+```
+python3 create_foia_data_db.py
+
+python3 search_foia_data_db.py --search_for "use of force"
+```
+
+produces 'use_of_force.json'.
+
+### 3. County Level Search
+
+- `get_allegheny_foias.py`, `allegheny_county_towns.txt`
+- To search for any and all requests in a certain county (e.g. Allegheny in this case) you must provide a list of all municipalities contained within the county. Muckrock stores geographic info in tiers, from Federal, State, and local level. At the local level, e.g. Pittsburgh and Allegheny County are in the same tier, with no way to determine which municipalities reside within a county (without providing it yourself).
+
+The `get_allegheny_foias.py` script will find the jurisdiction ID for each municipality in `allegheny_county_towns.txt`, then find all completed FOIA requests for those jurisdictions.
+
+### 4. Generate detailed FOIA data in PDAP database format
+
+- `generate_detailed_muckrock_csv.py`
+- Once you have a json of relevant FOIA's, run it through this script to generate a CSV that fulfills PDAP database requirements.
+
+### 5. ML Labeling
+
+- `muckrock_ml_labeler.py`
+- A tool for auto labeling MuckRock sources. This script is using [fine-url-classifier](https://huggingface.co/PDAP/fine-url-classifier) to assign 1 of 36 record type labels. At present, script is expecting each source to have associated header tags, provided via `html-tag-collector/collector.py`. (TODO: For muckrock sources, `collector.py` insufficient, does not grab main text of the request)
diff --git a/source_collectors/muckrock/allegheny-county-towns.txt b/source_collectors/muckrock/allegheny-county-towns.txt
new file mode 100644
index 00000000..4588e164
--- /dev/null
+++ b/source_collectors/muckrock/allegheny-county-towns.txt
@@ -0,0 +1,61 @@
+Allegheny County
+Allison Park
+Bairdford
+Bakerstown
+Bethel Park
+Brackenridge
+Braddock
+Bradfordwoods
+Bridgeville
+Buena Vista
+Bunola
+Carnegie
+Cheswick
+Clairton
+Coraopolis
+Coulters
+Creighton
+Crescent
+Cuddy
+Curtisville
+Dravosburg
+Duquesne
+East McKeesport
+East Pittsburgh
+Elizabeth
+Gibsonia
+Glassport
+Glenshaw
+Greenock
+Harwick
+Homestead
+Imperial
+Indianola
+Ingomar
+Leetsdale
+McKees Rocks
+Mckeesport
+Monroeville
+Morgan
+Natrona Heights
+North Versailles
+Oakdale
+Oakmont
+Pitcairn
+Pittsburgh
+Presto
+Rural Ridge
+Russellton
+Sewickley
+South Park
+Springdale
+Sturgeon
+Tarentum
+Turtle Creek
+Verona
+Warrendale
+West Elizabeth
+West Mifflin
+Wexford
+Wildwood
+Wilmerding
diff --git a/source_collectors/muckrock/create_foia_data_db.py b/source_collectors/muckrock/create_foia_data_db.py
new file mode 100644
index 00000000..4adc5556
--- /dev/null
+++ b/source_collectors/muckrock/create_foia_data_db.py
@@ -0,0 +1,279 @@
+"""
+create_foia_data_db.py
+
+This script fetches data from the MuckRock FOIA API and stores it in a SQLite database.
+Run this prior to companion script `search_foia_data_db.py`.
+
+A successful run will output a SQLite database `foia_data.db` with one table `results`.
+The database will contain all FOIA requests available through MuckRock.
+
+Functions:
+ - create_db()
+ - fetch_page()
+ - transform_page_data()
+ - populate_db()
+ - main()
+
+Error Handling:
+Errors encountered during API requests or database operations are logged to an `errors.log` file
+and/or printed to the console.
+"""
+
+import requests
+import sqlite3
+import logging
+import os
+import json
+import time
+from typing import List, Tuple, Dict, Any, Union, Literal
+
+logging.basicConfig(
+ filename="errors.log", level=logging.ERROR, format="%(levelname)s: %(message)s"
+)
+
+
+base_url = "https://www.muckrock.com/api_v1/foia/"
+last_page_fetched = "last_page_fetched.txt"
+
+NO_MORE_DATA = -1 # flag for program exit
+JSON = Dict[str, Any] # type alias
+
+
+create_table_query = """
+ CREATE TABLE IF NOT EXISTS results (
+ id INTEGER PRIMARY KEY,
+ title TEXT,
+ slug TEXT,
+ status TEXT,
+ embargo_status TEXT,
+ user INTEGER,
+ username TEXT,
+ agency INTEGER,
+ datetime_submitted TEXT,
+ date_due TEXT,
+ days_until_due INTEGER,
+ date_followup TEXT,
+ datetime_done TEXT,
+ datetime_updated TEXT,
+ date_embargo TEXT,
+ tracking_id TEXT,
+ price TEXT,
+ disable_autofollowups BOOLEAN,
+ tags TEXT,
+ communications TEXT,
+ absolute_url TEXT
+ )
+ """
+
+
+foia_insert_query = """
+ INSERT INTO results (id, title, slug, status, embargo_status, user, username, agency,
+ datetime_submitted, date_due, days_until_due, date_followup,
+ datetime_done, datetime_updated, date_embargo, tracking_id,
+ price, disable_autofollowups, tags, communications, absolute_url)
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+ """
+
+
+def create_db() -> bool:
+ """
+ Creates foia_data.db SQLite database with one table named `results`.
+
+ Returns:
+ bool: True, if database is successfully created; False otherwise.
+
+ Raises:
+ sqlite3.Error: If the table creation operation fails, prints error and returns False.
+ """
+
+ try:
+ with sqlite3.connect("foia_data.db") as conn:
+ conn.execute(create_table_query)
+ conn.commit()
+ print("Successfully created foia_data.db!")
+ return True
+ except sqlite3.Error as e:
+ print(f"SQLite error: {e}.")
+ logging.error(f"Failed to create foia_data.db due to SQLite error: {e}")
+ return False
+
+
+def fetch_page(page: int) -> Union[JSON, Literal[NO_MORE_DATA], None]:
+ """
+ Fetches a page of 100 results from the MuckRock FOIA API.
+
+ Args:
+ page (int): The page number to fetch from the API.
+
+ Returns:
+ Union[JSON, None, Literal[NO_MORE_DATA]]:
+ - JSON Dict[str, Any]: The response's JSON data, if the request is successful.
+ - NO_MORE_DATA (int = -1): A constant, if there are no more pages to fetch (indicated by a 404 response).
+ - None: If there is an error other than 404.
+ """
+
+ per_page = 100
+ response = requests.get(
+ base_url, params={"page": page, "page_size": per_page, "format": "json"}
+ )
+
+ if response.status_code == 200:
+ return response.json()
+ elif response.status_code == 404:
+ print("No more pages to fetch")
+ return NO_MORE_DATA # Typically 404 response will mean there are no more pages to fetch
+ elif 500 <= response.status_code < 600:
+ logging.error(f"Server error {response.status_code} on page {page}")
+ page = page + 1
+ return fetch_page(page)
+ else:
+ print(f"Error fetching page {page}: {response.status_code}")
+ logging.error(
+ f"Fetching page {page} failed with response code: {
+ response.status_code}"
+ )
+ return None
+
+
+def transform_page_data(data_to_transform: JSON) -> List[Tuple[Any, ...]]:
+ """
+ Transforms the data recieved from the MuckRock FOIA API into a structured format for insertion into a database with `populate_db()`.
+
+ Transforms JSON input into a list of tuples, as well as serializes the nested `tags` and `communications` fields into JSON strings.
+
+ Args:
+ data_to_transform (JSON: Dict[str, Any]): The JSON data from the API response.
+
+ Returns:
+ transformed_data (List[Tuple[Any, ...]]: A list of tuples, where each tuple contains the fields of a single FOIA request.
+ """
+
+ transformed_data = []
+
+ for result in data_to_transform.get("results", []):
+ result["tags"] = json.dumps(result.get("tags", []))
+ result["communications"] = json.dumps(result.get("communications", []))
+
+ transformed_data.append(
+ (
+ result["id"],
+ result["title"],
+ result["slug"],
+ result["status"],
+ result["embargo_status"],
+ result["user"],
+ result["username"],
+ result["agency"],
+ result["datetime_submitted"],
+ result["date_due"],
+ result["days_until_due"],
+ result["date_followup"],
+ result["datetime_done"],
+ result["datetime_updated"],
+ result["date_embargo"],
+ result["tracking_id"],
+ result["price"],
+ result["disable_autofollowups"],
+ result["tags"],
+ result["communications"],
+ result["absolute_url"],
+ )
+ )
+ return transformed_data
+
+
+def populate_db(transformed_data: List[Tuple[Any, ...]], page: int) -> None:
+ """
+ Populates foia_data.db SQLite database with the transfomed FOIA request data.
+
+ Args:
+ transformed_data (List[Tuple[Any, ...]]): A list of tuples, where each tuple contains the fields of a single FOIA request.
+ page (int): The current page number for printing and logging errors.
+
+ Returns:
+ None
+
+ Raises:
+ sqlite3.Error: If the insertion operation fails, attempts to retry operation (max_retries = 2). If retries are
+ exhausted, logs error and exits.
+ """
+
+ with sqlite3.connect("foia_data.db") as conn:
+
+ retries = 0
+ max_retries = 2
+ while retries < max_retries:
+ try:
+ conn.executemany(foia_insert_query, transformed_data)
+ conn.commit()
+ print("Successfully inserted data!")
+ return
+ except sqlite3.Error as e:
+ print(f"SQLite error: {e}. Retrying...")
+ conn.rollback()
+ retries += 1
+ time.sleep(1)
+
+ if retries == max_retries:
+ print(
+ f"Failed to insert data from page {page} after {
+ max_retries} attempts. Skipping to next page."
+ )
+ logging.error(
+ f"Failed to insert data from page {page} after {
+ max_retries} attempts."
+ )
+
+
+def main() -> None:
+ """
+ Main entry point for create_foia_data_db.py.
+
+ This function orchestrates the process of fetching FOIA requests data from the MuckRock FOIA API, transforming it,
+ and storing it in a SQLite database.
+ """
+
+ if not os.path.exists("foia_data.db"):
+ print("Creating foia_data.db...")
+ success = create_db()
+ if success == False:
+ print("Failed to create foia_data.db")
+ return
+
+ if os.path.exists(last_page_fetched):
+ with open(last_page_fetched, mode="r") as file:
+ page = int(file.read()) + 1
+ else:
+ page = 1
+
+ while True:
+
+ print(f"Fetching page {page}...")
+ page_data = fetch_page(page)
+
+ if page_data == NO_MORE_DATA:
+ break # Exit program because no more data exixts
+ if page_data is None:
+ print(f"Skipping page {page}...")
+ page += 1
+ continue
+
+ transformed_data = transform_page_data(page_data)
+
+ populate_db(transformed_data, page)
+
+ with open(last_page_fetched, mode="w") as file:
+ file.write(str(page))
+ page += 1
+
+ print("create_foia_data_db.py run finished")
+
+
+if __name__ == "__main__":
+ try:
+ main()
+ except Exception as e:
+ logging.error(f"An unexpected error occurred: {e}")
+ print(
+ "Check errors.log to review errors. Run create_foia_data_db.py again to continue"
+ )
diff --git a/source_collectors/muckrock/download_muckrock_foia.py b/source_collectors/muckrock/download_muckrock_foia.py
new file mode 100644
index 00000000..0abd527d
--- /dev/null
+++ b/source_collectors/muckrock/download_muckrock_foia.py
@@ -0,0 +1,58 @@
+"""
+***DEPRECATED***
+
+download_muckrock_foia.py
+
+This script fetches data from the MuckRock FOIA API and stores the results in a JSON file.
+
+"""
+
+import requests
+import csv
+import time
+import json
+
+# Define the base API endpoint
+base_url = "https://www.muckrock.com/api_v1/foia/"
+
+# Set initial parameters
+page = 1
+per_page = 100
+all_data = []
+output_file = "foia_data.json"
+
+
+def fetch_page(page):
+ """
+ Fetches data from a specific page of the MuckRock FOIA API.
+ """
+ response = requests.get(
+ base_url, params={"page": page, "page_size": per_page, "format": "json"}
+ )
+ if response.status_code == 200:
+ return response.json()
+ else:
+ print(f"Error fetching page {page}: {response.status_code}")
+ return None
+
+
+# Fetch and store data from all pages
+while True:
+ print(f"Fetching page {page}...")
+ data = fetch_page(page)
+ if data is None:
+ print(f"Skipping page {page}...")
+ page += 1
+ continue
+
+ all_data.extend(data["results"])
+ if not data["next"]:
+ break
+
+ page += 1
+
+# Write data to CSV
+with open(output_file, mode="w", encoding="utf-8") as json_file:
+ json.dump(all_data, json_file, indent=4)
+
+print(f"Data written to {output_file}")
diff --git a/source_collectors/muckrock/generate_detailed_muckrock_csv.py b/source_collectors/muckrock/generate_detailed_muckrock_csv.py
new file mode 100644
index 00000000..a077dbc7
--- /dev/null
+++ b/source_collectors/muckrock/generate_detailed_muckrock_csv.py
@@ -0,0 +1,182 @@
+import json
+import argparse
+import csv
+import requests
+import time
+from utils import format_filename_json_to_csv
+
+# Load the JSON data
+parser = argparse.ArgumentParser(description="Parse JSON from a file.")
+parser.add_argument(
+ "--json_file", type=str, required=True, help="Path to the JSON file"
+)
+
+args = parser.parse_args()
+
+with open(args.json_file, "r") as f:
+ json_data = json.load(f)
+
+# Define the CSV headers
+headers = [
+ "name",
+ "agency_described",
+ "record_type",
+ "description",
+ "source_url",
+ "readme_url",
+ "scraper_url",
+ "state",
+ "county",
+ "municipality",
+ "agency_type",
+ "jurisdiction_type",
+ "View Archive",
+ "agency_aggregation",
+ "agency_supplied",
+ "supplying_entity",
+ "agency_originated",
+ "originating_agency",
+ "coverage_start",
+ "source_last_updated",
+ "coverage_end",
+ "number_of_records_available",
+ "size",
+ "access_type",
+ "data_portal_type",
+ "access_notes",
+ "record_format",
+ "update_frequency",
+ "update_method",
+ "retention_schedule",
+ "detail_level",
+]
+
+
+def get_agency(agency_id):
+ """
+ Function to get agency_described
+ """
+ if agency_id:
+ agency_url = f"https://www.muckrock.com/api_v1/agency/{agency_id}/"
+ response = requests.get(agency_url)
+
+ if response.status_code == 200:
+ agency_data = response.json()
+ return agency_data
+ else:
+ return ""
+ else:
+ print("Agency ID not found in item")
+
+
+def get_jurisdiction(jurisdiction_id):
+ """
+ Function to get jurisdiction_described
+ """
+ if jurisdiction_id:
+ jurisdiction_url = (
+ f"https://www.muckrock.com/api_v1/jurisdiction/{jurisdiction_id}/"
+ )
+ response = requests.get(jurisdiction_url)
+
+ if response.status_code == 200:
+ jurisdiction_data = response.json()
+ return jurisdiction_data
+ else:
+ return ""
+ else:
+ print("Jurisdiction ID not found in item")
+
+
+output_csv = format_filename_json_to_csv(args.json_file)
+# Open a CSV file for writing
+with open(output_csv, "w", newline="") as csvfile:
+ writer = csv.DictWriter(csvfile, fieldnames=headers)
+
+ # Write the header row
+ writer.writeheader()
+
+ # Iterate through the JSON data
+ for item in json_data:
+ print(f"Writing data for {item.get('title')}")
+ agency_data = get_agency(item.get("agency"))
+ time.sleep(1)
+ jurisdiction_data = get_jurisdiction(agency_data.get("jurisdiction"))
+
+ jurisdiction_level = jurisdiction_data.get("level")
+ # federal jurisduction level
+ if jurisdiction_level == "f":
+ state = ""
+ county = ""
+ municipality = ""
+ juris_type = "federal"
+ # state jurisdiction level
+ if jurisdiction_level == "s":
+ state = jurisdiction_data.get("name")
+ county = ""
+ municipality = ""
+ juris_type = "state"
+ # local jurisdiction level
+ if jurisdiction_level == "l":
+ parent_juris_data = get_jurisdiction(jurisdiction_data.get("parent"))
+ state = parent_juris_data.get("abbrev")
+ if "County" in jurisdiction_data.get("name"):
+ county = jurisdiction_data.get("name")
+ municipality = ""
+ juris_type = "county"
+ else:
+ county = ""
+ municipality = jurisdiction_data.get("name")
+ juris_type = "local"
+
+ if "Police" in agency_data.get("types"):
+ agency_type = "law enforcement/police"
+ else:
+ agency_type = ""
+
+ source_url = ""
+ absolute_url = item.get("absolute_url")
+ access_type = ""
+ for comm in item["communications"]:
+ if comm["files"]:
+ source_url = absolute_url + "#files"
+ access_type = "Web page,Download,API"
+ break
+
+ # Extract the relevant fields from the JSON object
+ csv_row = {
+ "name": item.get("title", ""),
+ "agency_described": agency_data.get("name", "") + " - " + state,
+ "record_type": "",
+ "description": "",
+ "source_url": source_url,
+ "readme_url": absolute_url,
+ "scraper_url": "",
+ "state": state,
+ "county": county,
+ "municipality": municipality,
+ "agency_type": agency_type,
+ "jurisdiction_type": juris_type,
+ "View Archive": "",
+ "agency_aggregation": "",
+ "agency_supplied": "no",
+ "supplying_entity": "MuckRock",
+ "agency_originated": "yes",
+ "originating_agency": agency_data.get("name", ""),
+ "coverage_start": "",
+ "source_last_updated": "",
+ "coverage_end": "",
+ "number_of_records_available": "",
+ "size": "",
+ "access_type": access_type,
+ "data_portal_type": "MuckRock",
+ "access_notes": "",
+ "record_format": "",
+ "update_frequency": "",
+ "update_method": "",
+ "retention_schedule": "",
+ "detail_level": "",
+ }
+
+ # Write the extracted row to the CSV file
+ writer.writerow(csv_row)
diff --git a/source_collectors/muckrock/get_allegheny_foias.py b/source_collectors/muckrock/get_allegheny_foias.py
new file mode 100644
index 00000000..a559f67f
--- /dev/null
+++ b/source_collectors/muckrock/get_allegheny_foias.py
@@ -0,0 +1,94 @@
+"""
+get_allegheny_foias.py
+
+"""
+import requests
+import json
+import time
+
+
+def fetch_jurisdiction_ids(town_file, base_url):
+ """
+ fetch jurisdiction IDs based on town names from a text file
+ """
+ with open(town_file, "r") as file:
+ town_names = [line.strip() for line in file]
+
+ jurisdiction_ids = {}
+ url = base_url
+
+ while url:
+ response = requests.get(url)
+ if response.status_code == 200:
+ data = response.json()
+ for item in data.get("results", []):
+ if item["name"] in town_names:
+ jurisdiction_ids[item["name"]] = item["id"]
+
+ url = data.get("next")
+ print(
+ f"Processed page, found {len(jurisdiction_ids)}/{len(town_names)} jurisdictions so far..."
+ )
+ time.sleep(1) # To respect the rate limit
+
+ elif response.status_code == 503:
+ print("Error 503: Skipping page")
+ break
+ else:
+ print(f"Error fetching data: {response.status_code}")
+ break
+
+ return jurisdiction_ids
+
+
+def fetch_foia_data(jurisdiction_ids):
+ """
+ fetch FOIA data for each jurisdiction ID and save it to a JSON file
+ """
+ all_data = []
+ for name, id_ in jurisdiction_ids.items():
+ url = f"https://www.muckrock.com/api_v1/foia/?status=done&jurisdiction={id_}"
+ while url:
+ response = requests.get(url)
+ if response.status_code == 200:
+ data = response.json()
+ all_data.extend(data.get("results", []))
+ url = data.get("next")
+ print(
+ f"Fetching records for {name}, {len(all_data)} total records so far..."
+ )
+ time.sleep(1) # To respect the rate limit
+ elif response.status_code == 503:
+ print(f"Error 503: Skipping page for {name}")
+ break
+ else:
+ print(f"Error fetching data: {response.status_code} for {name}")
+ break
+
+ # Save the combined data to a JSON file
+ with open("foia_data_combined.json", "w") as json_file:
+ json.dump(all_data, json_file, indent=4)
+
+ print(f"Saved {len(all_data)} records to foia_data_combined.json")
+
+
+def main():
+ """
+ Execute the script
+ """
+ town_file = "allegheny-county-towns.txt"
+ jurisdiction_url = (
+ "https://www.muckrock.com/api_v1/jurisdiction/?level=l&parent=126"
+ )
+
+ # Fetch jurisdiction IDs based on town names
+ jurisdiction_ids = fetch_jurisdiction_ids(town_file, jurisdiction_url)
+ print(f"Jurisdiction IDs fetched: {jurisdiction_ids}")
+
+ # Fetch FOIA data for each jurisdiction ID
+ fetch_foia_data(jurisdiction_ids)
+
+
+# Run the main function
+if __name__ == "__main__":
+ main()
diff --git a/source_collectors/muckrock/muck_get.py b/source_collectors/muckrock/muck_get.py
new file mode 100644
index 00000000..20c29338
--- /dev/null
+++ b/source_collectors/muckrock/muck_get.py
@@ -0,0 +1,61 @@
+"""
+muck_get.py
+
+"""
+
+import requests
+import json
+
+# Define the base API endpoint
+base_url = "https://www.muckrock.com/api_v1/foia/"
+
+# Define the search string
+search_string = "use of force"
+per_page = 100
+page = 1
+all_results = []
+max_count = 20
+
+while True:
+
+ # Make the GET request with the search string as a query parameter
+ response = requests.get(
+ base_url, params={"page": page, "page_size": per_page, "format": "json"}
+ )
+
+ # Check if the request was successful
+ if response.status_code == 200:
+ # Parse the JSON response
+ data = response.json()
+
+ if not data["results"]:
+ break
+
+ filtered_results = [
+ item
+ for item in data["results"]
+ if search_string.lower() in item["title"].lower()
+ ]
+
+ all_results.extend(filtered_results)
+
+ if len(filtered_results) > 0:
+ num_results = len(filtered_results)
+ print(f"found {num_results} more matching result(s)...")
+
+ if len(all_results) >= max_count:
+ print("max count reached... exiting")
+ break
+
+ page += 1
+
+ else:
+ print(f"Error: {response.status_code}")
+ break
+
+# Dump list into a JSON file
+json_out_file = search_string.replace(" ", "_") + ".json"
+with open(json_out_file, "w") as json_file:
+ json.dump(all_results, json_file)
+
+print(f"List dumped into {json_out_file}")
diff --git a/source_collectors/muckrock/muckrock_ml_labeler.py b/source_collectors/muckrock/muckrock_ml_labeler.py
new file mode 100644
index 00000000..b313c045
--- /dev/null
+++ b/source_collectors/muckrock/muckrock_ml_labeler.py
@@ -0,0 +1,52 @@
+"""
+muckrock_ml_labeler.py
+
+"""
+
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import pandas as pd
+import argparse
+
+# Load the tokenizer and model
+model_name = "PDAP/fine-url-classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+model.eval()
+
+# Load the dataset from command line argument
+parser = argparse.ArgumentParser(description="Load CSV file into a pandas DataFrame.")
+parser.add_argument("--csv_file", type=str, required=True, help="Path to the CSV file")
+args = parser.parse_args()
+df = pd.read_csv(args.csv_file)
+
+# Combine multiple columns (e.g., 'url', 'html_title', 'h1') into a single text field for each row
+columns_to_combine = [
+ "url_path",
+ "html_title",
+ "h1",
+] # Add other columns here as needed
+df["combined_text"] = df[columns_to_combine].apply(
+ lambda row: " ".join(row.values.astype(str)), axis=1
+)
+
+# Convert the combined text into a list
+texts = df["combined_text"].tolist()
+
+# Tokenize the inputs
+inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
+
+# Perform inference
+with torch.no_grad():
+ outputs = model(**inputs)
+
+# Get the predicted labels
+predictions = torch.argmax(outputs.logits, dim=-1)
+
+# Map predictions to labels
+labels = model.config.id2label
+predicted_labels = [labels[int(pred)] for pred in predictions]
+
+# Add the predicted labels to the dataframe and save
+df["predicted_label"] = predicted_labels
+df.to_csv("labeled_muckrock_dataset.csv", index=False)
diff --git a/source_collectors/muckrock/requirements.txt b/source_collectors/muckrock/requirements.txt
new file mode 100644
index 00000000..babb4f3e
--- /dev/null
+++ b/source_collectors/muckrock/requirements.txt
@@ -0,0 +1,30 @@
+certifi==2024.8.30
+charset-normalizer==3.4.0
+filelock==3.16.1
+fsspec==2024.10.0
+huggingface-hub==0.26.1
+idna==3.10
+Jinja2==3.1.4
+logging==0.4.9.6
+MarkupSafe==3.0.2
+mpmath==1.3.0
+networkx==3.4.2
+numpy==2.1.2
+packaging==24.1
+pandas==2.2.3
+python-dateutil==2.9.0.post0
+pytz==2024.2
+PyYAML==6.0.2
+regex==2024.9.11
+requests==2.32.3
+safetensors==0.4.5
+setuptools==75.2.0
+six==1.16.0
+sympy==1.13.1
+tokenizers==0.20.1
+torch==2.5.0
+tqdm==4.66.5
+transformers==4.46.0
+typing_extensions==4.12.2
+tzdata==2024.2
+urllib3==2.2.3
diff --git a/source_collectors/muckrock/search_foia_data_db.py b/source_collectors/muckrock/search_foia_data_db.py
new file mode 100644
index 00000000..e7550608
--- /dev/null
+++ b/source_collectors/muckrock/search_foia_data_db.py
@@ -0,0 +1,188 @@
+"""
+search_foia_data_db.py
+
+This script provides search functionality for the `foia_data.db` SQLite database. The search looks in `title`s and
+`tags` of FOIA requests that match an input string provided by the user.
+Run this after companion script `create_foia_data_db.py`.
+
+A successful run will output a JSON file containing entries matching the search string.
+
+Functions:
+ - parser_init()
+ - search_foia_db()
+ - parse_communications_column()
+ - generate_json()
+ - main()
+
+Error Handling:
+Errors encountered during database operations, JSON parsing, or file writing are printed to the console.
+"""
+
+import sqlite3
+import pandas as pd
+import json
+import argparse
+import os
+from typing import Union, List, Dict
+
+check_results_table_query = """
+ SELECT name FROM sqlite_master
+ WHERE (type = 'table')
+ AND (name = 'results')
+ """
+
+search_foia_query = """
+ SELECT * FROM results
+ WHERE (title LIKE ? OR tags LIKE ?)
+ AND (status = 'done')
+ """
+
+
+def parser_init() -> argparse.ArgumentParser:
+ """
+ Initializes the argument parser for search_foia_data_db.py.
+
+ Returns:
+ argparse.ArgumentParser: The configured argument parser.
+ """
+
+ parser = argparse.ArgumentParser(
+ description="Search foia_data.db and generate a JSON file of resulting matches"
+ )
+ parser.add_argument(
+ "--search_for",
+ type=str,
+ required=True,
+ metavar="",
+ help="Provide a string to search foia_data.db",
+ )
+
+ return parser
+
+
+def search_foia_db(search_string: str) -> Union[pd.DataFrame, None]:
+ """
+ Searches the foia_data.db database for FOIA request entries matching the provided search string.
+
+ Args:
+ search_string (str): The string to search for in the `title` and `tags` of the `results` table.
+
+ Returns:
+ Union[pandas.DataFrame, None]:
+ - pandas.DataFrame: A DataFrame containing the matching entries from the database.
+ - None: If an error occurs during the database operation.
+
+ Raises:
+ sqlite3.Error: If any database operation fails, prints error and returns None.
+ Exception: If any unexpected error occurs, prints error and returns None.
+ """
+
+ print(f'Searching foia_data.db for "{search_string}"...')
+
+ try:
+ with sqlite3.connect("foia_data.db") as conn:
+
+ results_table = pd.read_sql_query(check_results_table_query, conn)
+
+ if results_table.empty:
+ print("The `results` table does not exist in the database.")
+ return None
+
+ params = [f"%{search_string}%", f"%{search_string}%"]
+
+ df = pd.read_sql_query(search_foia_query, conn, params=params)
+
+ except sqlite3.Error as e:
+ print(f"Sqlite error: {e}")
+ return None
+ except Exception as e:
+ print(f"An unexpected error occurred: {e}")
+ return None
+
+ return df
+
+
+def parse_communications_column(communications) -> List[Dict]:
+ """
+ Parses a communications column value, decoding it from JSON format.
+
+ Args:
+ communications : The input value to be parsed, which can be a JSON string or NaN.
+
+ Returns:
+ list (List[Dict]): A list containing the parsed JSON data. If the input is NaN (missing values) or
+ there is a JSON decoding error, an empty list is returned.
+
+ Raises:
+ json.JSONDecodeError: If deserialization fails, prints error and returns empty list.
+ """
+
+ if pd.isna(communications):
+ return []
+ try:
+ return json.loads(communications)
+ except json.JSONDecodeError as e:
+ print(f"Error decoding JSON: {e}")
+ return []
+
+
+def generate_json(df: pd.DataFrame, search_string: str) -> None:
+ """
+ Generates a JSON file from a pandas DataFrame.
+
+ Args:
+ df (pandas.DataFrame): The DataFrame containing the data to be written to the JSON file.
+
+ search_string (str): The string used to name the output JSON file. Spaces in the string
+ are replaced with underscores.
+
+ Returns:
+ None
+
+ Raises:
+ Exception: If writing to JSON file operation fails, prints error and returns.
+ """
+
+ output_json = f"{search_string.replace(' ', '_')}.json"
+
+ try:
+ df.to_json(output_json, orient="records", indent=4)
+ print(f'Matching entries written to "{output_json}"')
+ except Exception as e:
+ print(f"An error occurred while writing JSON: {e}")
+
+
+def main() -> None:
+ """
+ Function to search the foia_data.db database for entries matching a specified search string.
+
+ Command Line Args:
+ --search_for (str): A string to search for in the `title` and `tags` fields of FOIA requests.
+ """
+
+ parser = parser_init()
+ args = parser.parse_args()
+ search_string = args.search_for
+
+ if not os.path.exists("foia_data.db"):
+ print(
+ "foia_data.db does not exist.\nRun create_foia_data_db.py first to create and populate it."
+ )
+ return
+
+ df = search_foia_db(search_string)
+ if df is None:
+ return
+
+ if not df["communications"].empty:
+ df["communications"] = df["communications"].apply(parse_communications_column)
+
+ print(
+ f'Found {df.shape[0]} matching entries containing "{search_string}" in the title or tags'
+ )
+
+ generate_json(df, search_string)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/source_collectors/muckrock/search_local_foia_json.py b/source_collectors/muckrock/search_local_foia_json.py
new file mode 100644
index 00000000..562c4bae
--- /dev/null
+++ b/source_collectors/muckrock/search_local_foia_json.py
@@ -0,0 +1,53 @@
+"""
+***DEPRECATED***
+
+search_local_foia_json.py
+
+"""
+
+import json
+
+# Specify the JSON file path
+json_file = "foia_data.json"
+search_string = "use of force"
+
+# Load the JSON data
+with open(json_file, "r", encoding="utf-8") as file:
+ data = json.load(file)
+
+# List to store matching entries
+matching_entries = []
+
+
+def search_entry(entry):
+ """
+ search within an entry
+ """
+ # Check if 'status' is 'done'
+ if entry.get("status") != "done":
+ return False
+
+ # Check if 'title' or 'tags' field contains the search string
+ title_match = "title" in entry and search_string.lower() in entry["title"].lower()
+ tags_match = "tags" in entry and any(
+ search_string.lower() in tag.lower() for tag in entry["tags"]
+ )
+
+ return title_match or tags_match
+
+
+# Iterate through the data and collect matching entries
+for entry in data:
+ if search_entry(entry):
+ matching_entries.append(entry)
+
+# Output the results
+print(
+ f"Found {len(matching_entries)} entries containing '{search_string}' in the title or tags."
+)
+
+# Optionally, write matching entries to a new JSON file
+with open("matching_entries.json", "w", encoding="utf-8") as file:
+ json.dump(matching_entries, file, indent=4)
+
+print("Matching entries written to 'matching_entries.json'")
diff --git a/source_collectors/muckrock/utils.py b/source_collectors/muckrock/utils.py
new file mode 100644
index 00000000..3d8b63db
--- /dev/null
+++ b/source_collectors/muckrock/utils.py
@@ -0,0 +1,26 @@
+"""
+utils.py
+
+Provides useful functions for muckrock_tools.
+
+Functions:
+ - format_filename_json_to_csv()
+"""
+
+import re
+
+
+def format_filename_json_to_csv(json_filename: str) -> str:
+ """
+ Converts JSON filename format to CSV filename format.
+
+ Args:
+ json_file (str): A JSON filename string.
+
+ Returns:
+ csv_filename (str): A CSV filename string.
+
+ """
+ csv_filename = re.sub(r"_(?=[^.]*$)", "-", json_filename[:-5]) + ".csv"
+
+ return csv_filename