Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ hugging_face | Utilities for interacting with our machine learning space at [Hug
identification_pipeline.py | The core python script uniting this modular pipeline. More details below.
openai-playground | Scripts for accessing the openai API on PDAP's shared account
source_collectors| Tools for extracting metadata from different sources, including CKAN data portals and Common Crawler
collector_db | Database for storing data from source collectors

## How to use

Expand Down
14 changes: 14 additions & 0 deletions collector_db/BatchInfo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from pydantic import BaseModel

Check warning on line 1 in collector_db/BatchInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/BatchInfo.py#L1 <100>

Missing docstring in public module
Raw output
./collector_db/BatchInfo.py:1:1: D100 Missing docstring in public module


class BatchInfo(BaseModel):

Check warning on line 4 in collector_db/BatchInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/BatchInfo.py#L4 <101>

Missing docstring in public class
Raw output
./collector_db/BatchInfo.py:4:1: D101 Missing docstring in public class
strategy: str
status: str
count: int = 0
strategy_success_rate: float = None
metadata_success_rate: float = None
agency_match_rate: float = None
record_type_match_rate: float = None
record_category_match_rate: float = None
compute_time: int = None
parameters: dict = None

Check warning on line 14 in collector_db/BatchInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/BatchInfo.py#L14 <292>

no newline at end of file
Raw output
./collector_db/BatchInfo.py:14:28: W292 no newline at end of file
120 changes: 120 additions & 0 deletions collector_db/DatabaseClient.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
from functools import wraps

Check warning on line 1 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L1 <100>

Missing docstring in public module
Raw output
./collector_db/DatabaseClient.py:1:1: D100 Missing docstring in public module

from sqlalchemy import create_engine, Column, Integer, String, Float, Text, JSON, ForeignKey, CheckConstraint, TIMESTAMP, UniqueConstraint

Check warning on line 3 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L3 <401>

'sqlalchemy.UniqueConstraint' imported but unused
Raw output
./collector_db/DatabaseClient.py:3:1: F401 'sqlalchemy.UniqueConstraint' imported but unused
from sqlalchemy.orm import declarative_base, sessionmaker, relationship
from typing import Optional, Dict, Any, List

Check warning on line 5 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L5 <401>

'typing.Dict' imported but unused
Raw output
./collector_db/DatabaseClient.py:5:1: F401 'typing.Dict' imported but unused

Check warning on line 5 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L5 <401>

'typing.Any' imported but unused
Raw output
./collector_db/DatabaseClient.py:5:1: F401 'typing.Any' imported but unused

from collector_db.BatchInfo import BatchInfo
from collector_db.URLInfo import URLInfo

# Base class for SQLAlchemy ORM models
Base = declarative_base()

# SQLAlchemy ORM models
class Batch(Base):

Check warning on line 14 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L14 <101>

Missing docstring in public class
Raw output
./collector_db/DatabaseClient.py:14:1: D101 Missing docstring in public class
__tablename__ = 'batches'

id = Column(Integer, primary_key=True)
strategy = Column(String, nullable=False)
status = Column(String, CheckConstraint("status IN ('in-process', 'complete', 'error')"), nullable=False)
count = Column(Integer, nullable=False)
date_generated = Column(TIMESTAMP, nullable=False, server_default="CURRENT_TIMESTAMP")
strategy_success_rate = Column(Float)
metadata_success_rate = Column(Float)
agency_match_rate = Column(Float)
record_type_match_rate = Column(Float)
record_category_match_rate = Column(Float)
compute_time = Column(Integer)
parameters = Column(JSON)

urls = relationship("URL", back_populates="batch")
missings = relationship("Missing", back_populates="batch")


class URL(Base):

Check warning on line 34 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L34 <101>

Missing docstring in public class
Raw output
./collector_db/DatabaseClient.py:34:1: D101 Missing docstring in public class
__tablename__ = 'urls'

id = Column(Integer, primary_key=True)
batch_id = Column(Integer, ForeignKey('batches.id'), nullable=False)
url = Column(Text, unique=True)
url_metadata = Column(JSON)
outcome = Column(String)
created_at = Column(TIMESTAMP, nullable=False, server_default="CURRENT_TIMESTAMP")

batch = relationship("Batch", back_populates="urls")


class Missing(Base):

Check warning on line 47 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L47 <101>

Missing docstring in public class
Raw output
./collector_db/DatabaseClient.py:47:1: D101 Missing docstring in public class
__tablename__ = 'missing'

id = Column(Integer, primary_key=True)
place_id = Column(Integer, nullable=False)
record_type = Column(String, nullable=False)
batch_id = Column(Integer, ForeignKey('batches.id'))
strategy_used = Column(Text, nullable=False)
date_searched = Column(TIMESTAMP, nullable=False, server_default="CURRENT_TIMESTAMP")

batch = relationship("Batch", back_populates="missings")


# Database Client
class DatabaseClient:

Check warning on line 61 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L61 <101>

Missing docstring in public class
Raw output
./collector_db/DatabaseClient.py:61:1: D101 Missing docstring in public class
def __init__(self, db_url: str = "sqlite:///database.db"):
"""Initialize the DatabaseClient."""
self.engine = create_engine(db_url, echo=True)
Base.metadata.create_all(self.engine)
self.session_maker = sessionmaker(bind=self.engine)
self.session = None

def session_manager(method):

Check warning on line 69 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L69 <102>

Missing docstring in public method
Raw output
./collector_db/DatabaseClient.py:69:1: D102 Missing docstring in public method
@wraps(method)
def wrapper(self, *args, **kwargs):
self.session = self.session_maker()
try:
result = method(self, *args, **kwargs)
self.session.commit()
return result
except Exception as e:
self.session.rollback()
raise e
finally:
self.session.close()
self.session = None

@session_manager
def insert_batch(self, batch_info: BatchInfo) -> Batch:
"""Insert a new batch into the database."""
batch = Batch(
**batch_info.model_dump()
)
self.session.add(batch)
return batch

@session_manager
def get_batch_by_id(self, batch_id: int) -> Optional[BatchInfo]:
"""Retrieve a batch by ID."""
batch = self.session.query(Batch).filter_by(id=batch_id).first()
return BatchInfo(**batch.__dict__)

@session_manager
def insert_url(self, url_info: URLInfo):
"""Insert a new URL into the database."""
url_entry = URL(
**url_info.model_dump()
)
self.session.add(url_entry)

@session_manager
def get_urls_by_batch(self, batch_id: int) -> List[URLInfo]:
"""Retrieve all URLs associated with a batch."""
urls = self.session.query(URL).filter_by(batch_id=batch_id).all()
return ([URLInfo(**url.__dict__) for url in urls])

@session_manager
def is_duplicate_url(self, url: str) -> bool:

Check warning on line 114 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L114 <102>

Missing docstring in public method
Raw output
./collector_db/DatabaseClient.py:114:1: D102 Missing docstring in public method
result = self.session.query(URL).filter_by(url=url).first()
return result is not None

if __name__ == "__main__":

Check failure on line 118 in collector_db/DatabaseClient.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/DatabaseClient.py#L118 <305>

expected 2 blank lines after class or function definition, found 1
Raw output
./collector_db/DatabaseClient.py:118:1: E305 expected 2 blank lines after class or function definition, found 1
client = DatabaseClient()
print("Database client initialized.")
8 changes: 8 additions & 0 deletions collector_db/URLInfo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from pydantic import BaseModel

Check warning on line 1 in collector_db/URLInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/URLInfo.py#L1 <100>

Missing docstring in public module
Raw output
./collector_db/URLInfo.py:1:1: D100 Missing docstring in public module


class URLInfo(BaseModel):

Check warning on line 4 in collector_db/URLInfo.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] collector_db/URLInfo.py#L4 <101>

Missing docstring in public class
Raw output
./collector_db/URLInfo.py:4:1: D101 Missing docstring in public class
batch_id: int
url: str
url_metadata: dict
outcome: str
Empty file added collector_db/__init__.py
Empty file.
Binary file added collector_db/database.db
Binary file not shown.
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,5 @@ requests_html>=0.10.0
lxml~=5.1.0
pyppeteer>=2.0.0
beautifulsoup4>=4.12.3

sqlalchemy~=2.0.36
Loading