Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
93bdbe7
Basic refactoring:
maxachis Dec 15, 2024
0fe043f
Basic refactoring:
maxachis Dec 15, 2024
4f5cc51
Basic refactoring:
maxachis Dec 15, 2024
9294fb0
Basic refactoring:
maxachis Dec 15, 2024
dd3ee80
Refactor: Add FOIAFetcher
maxachis Dec 15, 2024
435b090
Refactor: Add utility functions
maxachis Dec 15, 2024
7dd7d0c
Refactor: Create FOIASearcher
maxachis Dec 15, 2024
dd3f0a2
Remove `search_local_foia_json.py`
maxachis Dec 15, 2024
01d5f6b
Refactor: Create MuckrockFetchers
maxachis Dec 16, 2024
cc5b20d
Refactor: Modularize Logic
maxachis Dec 16, 2024
56062d2
Refactor: Modularize Logic
maxachis Dec 16, 2024
62f5a50
Refactor get_allegheny_foias.py
maxachis Dec 16, 2024
b6b30a4
Refactor create_foia_data_db.py
maxachis Dec 16, 2024
ee4a854
Refactor search_foia_data_db.py
maxachis Dec 16, 2024
ee76173
Refactor Directory
maxachis Dec 16, 2024
147a786
Begin draft of PDAP client
maxachis Dec 18, 2024
82d8c5b
Continue draft
maxachis Dec 18, 2024
55695fb
Continue draft of PDAP client
maxachis Dec 18, 2024
e8d599e
Refactor: Move AccessManager to separate file
maxachis Dec 18, 2024
8f10a8f
Add "Creating a Collector" section to CollectorManager readme.
maxachis Dec 22, 2024
67e54e8
Develop AutoGooglerCollector and connect to CollectorManager
maxachis Dec 22, 2024
da67086
Create Source Collector Core
maxachis Dec 24, 2024
0b50599
Incorporate CommonCrawler into Source Collector
maxachis Dec 25, 2024
34430cb
Merge branch 'mc_105_muckrock_scraper_enhancements' into mc_122_sourc…
maxachis Dec 25, 2024
c7fc102
Incorporate Muckrock Collectors into Source Collector
maxachis Dec 25, 2024
d8c19fd
Incorporate CKAN Collector into Source Collector
maxachis Dec 26, 2024
bde0ba7
Incorporate CKAN Collector into Source Collector
maxachis Dec 26, 2024
e1a8ad9
Add README files and rearrange some logic.
maxachis Dec 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ openai-playground | Scripts for accessing the openai API on PDAP's shared accoun
source_collectors| Tools for extracting metadata from different sources, including CKAN data portals and Common Crawler
collector_db | Database for storing data from source collectors
collector_manager | A module which provides a unified interface for interacting with source collectors and relevant data
core | A module which integrates other components, such as collector_manager and collector_db

## How to use

Expand Down
14 changes: 0 additions & 14 deletions collector_db/BatchInfo.py

This file was deleted.

20 changes: 20 additions & 0 deletions collector_db/DTOs/BatchInfo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from datetime import datetime

Check warning on line 1 in collector_db/DTOs/BatchInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/BatchInfo.py#L1 <100>

Missing docstring in public module
Raw output
./collector_db/DTOs/BatchInfo.py:1:1: D100 Missing docstring in public module
from typing import Optional

from pydantic import BaseModel

from core.enums import BatchStatus


class BatchInfo(BaseModel):

Check warning on line 9 in collector_db/DTOs/BatchInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/BatchInfo.py#L9 <101>

Missing docstring in public class
Raw output
./collector_db/DTOs/BatchInfo.py:9:1: D101 Missing docstring in public class
strategy: str
status: BatchStatus
parameters: dict
count: int = 0
strategy_success_rate: Optional[float] = None
metadata_success_rate: Optional[float] = None
agency_match_rate: Optional[float] = None
record_type_match_rate: Optional[float] = None
record_category_match_rate: Optional[float] = None
compute_time: Optional[float] = None
date_generated: Optional[datetime] = None
8 changes: 8 additions & 0 deletions collector_db/DTOs/DuplicateInfo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from pydantic import BaseModel

Check warning on line 1 in collector_db/DTOs/DuplicateInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/DuplicateInfo.py#L1 <100>

Missing docstring in public module
Raw output
./collector_db/DTOs/DuplicateInfo.py:1:1: D100 Missing docstring in public module


class DuplicateInfo(BaseModel):

Check warning on line 4 in collector_db/DTOs/DuplicateInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/DuplicateInfo.py#L4 <101>

Missing docstring in public class
Raw output
./collector_db/DTOs/DuplicateInfo.py:4:1: D101 Missing docstring in public class
source_url: str
original_url_id: int
duplicate_metadata: dict
original_metadata: dict

Check warning on line 8 in collector_db/DTOs/DuplicateInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/DuplicateInfo.py#L8 <292>

no newline at end of file
Raw output
./collector_db/DTOs/DuplicateInfo.py:8:28: W292 no newline at end of file
9 changes: 9 additions & 0 deletions collector_db/DTOs/InsertURLsInfo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from pydantic import BaseModel

Check warning on line 1 in collector_db/DTOs/InsertURLsInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/InsertURLsInfo.py#L1 <100>

Missing docstring in public module
Raw output
./collector_db/DTOs/InsertURLsInfo.py:1:1: D100 Missing docstring in public module

from collector_db.DTOs.DuplicateInfo import DuplicateInfo
from collector_db.DTOs.URLMapping import URLMapping


class InsertURLsInfo(BaseModel):

Check warning on line 7 in collector_db/DTOs/InsertURLsInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/InsertURLsInfo.py#L7 <101>

Missing docstring in public class
Raw output
./collector_db/DTOs/InsertURLsInfo.py:7:1: D101 Missing docstring in public class
url_mappings: list[URLMapping]
duplicates: list[DuplicateInfo]

Check warning on line 9 in collector_db/DTOs/InsertURLsInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/InsertURLsInfo.py#L9 <292>

no newline at end of file
Raw output
./collector_db/DTOs/InsertURLsInfo.py:9:36: W292 no newline at end of file
1 change: 1 addition & 0 deletions collector_db/DTOs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This directory consists of data transfer objects (DTOs) for the Source Collector Database.
13 changes: 13 additions & 0 deletions collector_db/DTOs/URLInfo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from typing import Optional

Check warning on line 1 in collector_db/DTOs/URLInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/URLInfo.py#L1 <100>

Missing docstring in public module
Raw output
./collector_db/DTOs/URLInfo.py:1:1: D100 Missing docstring in public module

from pydantic import BaseModel

from collector_manager.enums import URLOutcome


class URLInfo(BaseModel):

Check warning on line 8 in collector_db/DTOs/URLInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/URLInfo.py#L8 <101>

Missing docstring in public class
Raw output
./collector_db/DTOs/URLInfo.py:8:1: D101 Missing docstring in public class
id: Optional[int] = None
batch_id: Optional[int] = None
url: str
url_metadata: Optional[dict] = None
outcome: URLOutcome = URLOutcome.PENDING
6 changes: 6 additions & 0 deletions collector_db/DTOs/URLMapping.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from pydantic import BaseModel

Check warning on line 1 in collector_db/DTOs/URLMapping.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/URLMapping.py#L1 <100>

Missing docstring in public module
Raw output
./collector_db/DTOs/URLMapping.py:1:1: D100 Missing docstring in public module


class URLMapping(BaseModel):

Check warning on line 4 in collector_db/DTOs/URLMapping.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DTOs/URLMapping.py#L4 <101>

Missing docstring in public class
Raw output
./collector_db/DTOs/URLMapping.py:4:1: D101 Missing docstring in public class
url: str
url_id: int
File renamed without changes.
133 changes: 76 additions & 57 deletions collector_db/DatabaseClient.py
Original file line number Diff line number Diff line change
@@ -1,60 +1,21 @@
from functools import wraps

from sqlalchemy import create_engine, Column, Integer, String, Float, Text, JSON, ForeignKey, CheckConstraint, TIMESTAMP, UniqueConstraint
from sqlalchemy.orm import declarative_base, sessionmaker, relationship
from typing import Optional, Dict, Any, List
from sqlalchemy import create_engine
from sqlalchemy.exc import IntegrityError
from sqlalchemy.orm import sessionmaker
from typing import Optional, List

from collector_db.BatchInfo import BatchInfo
from collector_db.URLInfo import URLInfo
from collector_db.DTOs.BatchInfo import BatchInfo
from collector_db.DTOs.DuplicateInfo import DuplicateInfo
from collector_db.DTOs.InsertURLsInfo import InsertURLsInfo
from collector_db.DTOs.URLMapping import URLMapping
from collector_db.DTOs.URLInfo import URLInfo
from collector_db.models import Base, Batch, URL
from core.enums import BatchStatus

# Base class for SQLAlchemy ORM models
Base = declarative_base()

# SQLAlchemy ORM models
class Batch(Base):
__tablename__ = 'batches'

id = Column(Integer, primary_key=True)
strategy = Column(String, nullable=False)
status = Column(String, CheckConstraint("status IN ('in-process', 'complete', 'error')"), nullable=False)
count = Column(Integer, nullable=False)
date_generated = Column(TIMESTAMP, nullable=False, server_default="CURRENT_TIMESTAMP")
strategy_success_rate = Column(Float)
metadata_success_rate = Column(Float)
agency_match_rate = Column(Float)
record_type_match_rate = Column(Float)
record_category_match_rate = Column(Float)
compute_time = Column(Integer)
parameters = Column(JSON)

urls = relationship("URL", back_populates="batch")
missings = relationship("Missing", back_populates="batch")


class URL(Base):
__tablename__ = 'urls'

id = Column(Integer, primary_key=True)
batch_id = Column(Integer, ForeignKey('batches.id'), nullable=False)
url = Column(Text, unique=True)
url_metadata = Column(JSON)
outcome = Column(String)
created_at = Column(TIMESTAMP, nullable=False, server_default="CURRENT_TIMESTAMP")

batch = relationship("Batch", back_populates="urls")


class Missing(Base):
__tablename__ = 'missing'

id = Column(Integer, primary_key=True)
place_id = Column(Integer, nullable=False)
record_type = Column(String, nullable=False)
batch_id = Column(Integer, ForeignKey('batches.id'))
strategy_used = Column(Text, nullable=False)
date_searched = Column(TIMESTAMP, nullable=False, server_default="CURRENT_TIMESTAMP")

batch = relationship("Batch", back_populates="missings")
# SQLAlchemy ORM models

Check failure on line 18 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L18 <303>

too many blank lines (3)
Raw output
./collector_db/DatabaseClient.py:18:1: E303 too many blank lines (3)


# Database Client
Expand All @@ -81,28 +42,86 @@
self.session.close()
self.session = None

return wrapper

@session_manager
def insert_batch(self, batch_info: BatchInfo) -> Batch:
"""Insert a new batch into the database."""
def insert_batch(self, batch_info: BatchInfo) -> int:
"""Insert a new batch into the database and return its ID."""
batch = Batch(
**batch_info.model_dump()
strategy=batch_info.strategy,
status=batch_info.status.value,
parameters=batch_info.parameters,
count=batch_info.count,
compute_time=batch_info.compute_time,
strategy_success_rate=batch_info.strategy_success_rate,
metadata_success_rate=batch_info.metadata_success_rate,
agency_match_rate=batch_info.agency_match_rate,
record_type_match_rate=batch_info.record_type_match_rate,
record_category_match_rate=batch_info.record_category_match_rate,
)
self.session.add(batch)
return batch
self.session.commit()
self.session.refresh(batch)
return batch.id

@session_manager
def update_batch_post_collection(

Check warning on line 68 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L68 <102>

Missing docstring in public method
Raw output
./collector_db/DatabaseClient.py:68:1: D102 Missing docstring in public method
self,
batch_id: int,
url_count: int,
batch_status: BatchStatus,
compute_time: float = None,
):
batch = self.session.query(Batch).filter_by(id=batch_id).first()
batch.count = url_count
batch.status = batch_status.value
batch.compute_time = compute_time

@session_manager
def get_batch_by_id(self, batch_id: int) -> Optional[BatchInfo]:
"""Retrieve a batch by ID."""
batch = self.session.query(Batch).filter_by(id=batch_id).first()
return BatchInfo(**batch.__dict__)

def insert_urls(self, url_infos: List[URLInfo], batch_id: int) -> InsertURLsInfo:

Check warning on line 86 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L86 <102>

Missing docstring in public method
Raw output
./collector_db/DatabaseClient.py:86:1: D102 Missing docstring in public method
url_mappings = []
duplicates = []
for url_info in url_infos:
url_info.batch_id = batch_id
try:
url_id = self.insert_url(url_info)
url_mappings.append(URLMapping(url_id=url_id, url=url_info.url))
except IntegrityError:
orig_url_info = self.get_url_info_by_url(url_info.url)
duplicates.append(DuplicateInfo(
source_url=url_info.url,
original_url_id=orig_url_info.id,
duplicate_metadata=url_info.url_metadata,
original_metadata=orig_url_info.url_metadata
))

return InsertURLsInfo(url_mappings=url_mappings, duplicates=duplicates)


@session_manager
def insert_url(self, url_info: URLInfo):
def get_url_info_by_url(self, url: str) -> Optional[URLInfo]:

Check warning on line 107 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L107 <102>

Missing docstring in public method
Raw output
./collector_db/DatabaseClient.py:107:1: D102 Missing docstring in public method
url = self.session.query(URL).filter_by(url=url).first()
return URLInfo(**url.__dict__)

@session_manager
def insert_url(self, url_info: URLInfo) -> int:
"""Insert a new URL into the database."""
url_entry = URL(
**url_info.model_dump()
batch_id=url_info.batch_id,
url=url_info.url,
url_metadata=url_info.url_metadata,
outcome=url_info.outcome.value
)
self.session.add(url_entry)
self.session.commit()
self.session.refresh(url_entry)
return url_entry.id


@session_manager
def get_urls_by_batch(self, batch_id: int) -> List[URLInfo]:
Expand Down
1 change: 1 addition & 0 deletions collector_db/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The collector database is a database for storing collector data and associated metadata. It consists of both the database structure itself as well as the interfaces and helper functions for interacting with it.
8 changes: 0 additions & 8 deletions collector_db/URLInfo.py

This file was deleted.

75 changes: 75 additions & 0 deletions collector_db/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
"""
SQLAlchemy ORM models
"""
from sqlalchemy import func, Column, Integer, String, CheckConstraint, TIMESTAMP, Float, JSON, ForeignKey, Text
from sqlalchemy.orm import declarative_base, relationship

from core.enums import BatchStatus
from util.helper_functions import get_enum_values

# Base class for SQLAlchemy ORM models
Base = declarative_base()

status_check_string = ", ".join([f"'{status}'" for status in get_enum_values(BatchStatus)])

CURRENT_TIME_SERVER_DEFAULT = func.now()


class Batch(Base):

Check warning on line 18 in collector_db/models.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/models.py#L18 <101>

Missing docstring in public class
Raw output
./collector_db/models.py:18:1: D101 Missing docstring in public class
__tablename__ = 'batches'

id = Column(Integer, primary_key=True)
strategy = Column(String, nullable=False)
# Gives the status of the batch
status = Column(String, CheckConstraint(f"status IN ({status_check_string})"), nullable=False)
# The number of URLs in the batch
# TODO: Add means to update after execution
count = Column(Integer, nullable=False)
date_generated = Column(TIMESTAMP, nullable=False, server_default=CURRENT_TIME_SERVER_DEFAULT)
# How often URLs ended up approved in the database
strategy_success_rate = Column(Float)
# Percentage of metadata identified by models
metadata_success_rate = Column(Float)
# Rate of matching to agencies
agency_match_rate = Column(Float)
# Rate of matching to record types
record_type_match_rate = Column(Float)
# Rate of matching to record categories
record_category_match_rate = Column(Float)
# Time taken to generate the batch
# TODO: Add means to update after execution
compute_time = Column(Float)
# The parameters used to generate the batch
parameters = Column(JSON)

urls = relationship("URL", back_populates="batch")
missings = relationship("Missing", back_populates="batch")


class URL(Base):

Check warning on line 49 in collector_db/models.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/models.py#L49 <101>

Missing docstring in public class
Raw output
./collector_db/models.py:49:1: D101 Missing docstring in public class
__tablename__ = 'urls'

id = Column(Integer, primary_key=True)
# The batch this URL is associated with
batch_id = Column(Integer, ForeignKey('batches.id'), nullable=False)
url = Column(Text, unique=True)
# The metadata associated with the URL
url_metadata = Column(JSON)
# The outcome of the URL: submitted, human_labeling, rejected, duplicate, etc.
outcome = Column(String)
created_at = Column(TIMESTAMP, nullable=False, server_default=CURRENT_TIME_SERVER_DEFAULT)

batch = relationship("Batch", back_populates="urls")


class Missing(Base):

Check warning on line 65 in collector_db/models.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/models.py#L65 <101>

Missing docstring in public class
Raw output
./collector_db/models.py:65:1: D101 Missing docstring in public class
__tablename__ = 'missing'

id = Column(Integer, primary_key=True)
place_id = Column(Integer, nullable=False)
record_type = Column(String, nullable=False)
batch_id = Column(Integer, ForeignKey('batches.id'))
strategy_used = Column(Text, nullable=False)
date_searched = Column(TIMESTAMP, nullable=False, server_default=CURRENT_TIME_SERVER_DEFAULT)

batch = relationship("Batch", back_populates="missings")
Loading
Loading