Skip to content

Conversation

@maxachis
Copy link
Collaborator

#133

This incorporates several changes. Among them:

  • The addition of logic for pipelining URLs into Huggingface
  • The addition of logic for retrieving HTML data for URLs
  • Logic for repeatedly setting up "cycles" for the continual retrieval of the above data where needed.
  • Logic for dumping production data and setting up a test database with the most up-to-date version of the database (for testing against live data)
  • Bug fixes and QoL improvements for the HTML tag collector
  • Revisions to the Root URL Cache so that it persists in the database rather than existing as a json file that will be deleted when a container stops.
  • Creation of /url endpoint for retrieving information about URLs without filtering by Batch.
  • Creation of /annotation/url endpoints for the annotating of URLs regarding their relevancy.

* Create URL Metadata Table
* Convert batch `status`, `strategy` columns to enums
* Convert URL `status` column to enum
* Add new migration tests
* Add database structure tests
* Update tests
- requests_html library, previously used, has not been maintained and was causing bugs
@gitguardian
Copy link

gitguardian bot commented Jan 19, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
15222185 Triggered Generic Password 974ca99 local_database/DataDumper/docker-compose.yml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@maxachis maxachis merged commit e39d687 into dev Jan 23, 2025
4 checks passed
@maxachis maxachis deleted the mc_133_huggingface_pipeline branch April 17, 2025 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants