Skip to content

Seed registry with 3,234 domains from UCPChecker.com#2

Open
homototus wants to merge 1 commit into
mainfrom
feat/seed-and-cleanup
Open

Seed registry with 3,234 domains from UCPChecker.com#2
homototus wants to merge 1 commit into
mainfrom
feat/seed-and-cleanup

Conversation

@homototus

Copy link
Copy Markdown
Owner

Summary

  • Seed registry.json with 3,234 merchant domains from UCPChecker.com (CC-BY 4.0)
  • All domains added as status: "pending" — our crawler will independently verify each on next run
  • Add scripts/seed_from_ucpchecker.py for reproducible seeding
  • Fix: move urljoin to module-level import in verify.py

How seeding works

  1. Fetch all 65 pages of UCPChecker's public directory (2s delay, under their 100 req/hr limit)
  2. Extract domain names only (no proprietary data copied)
  3. Add each as a pending node in registry.json
  4. Our crawler independently verifies via /.well-known/ucp

After merge

The next workflow run (~6h cron or manual trigger) will verify all 3,236 nodes. Expect many to become verified — UCPChecker reports ~2,800 with valid profiles.

Data attribution

UCPChecker is credited in README under "Data Sources". Their data license is CC-BY 4.0.

Seeding script fetches merchant domains from UCPChecker's public
directory (65 pages, 2s delay between requests, well under their
100 req/hour limit). Only domain names are extracted — all node
data is verified independently by our crawler.

Also fixes: move urljoin to module-level import in verify.py.

UCPChecker data is CC-BY 4.0 licensed and credited in README.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant