Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track seen URLs at all the domains we monitor #173

Open
Mr0grog opened this issue Feb 7, 2025 · 0 comments
Open

Track seen URLs at all the domains we monitor #173

Mr0grog opened this issue Feb 7, 2025 · 0 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Feb 7, 2025

We’d like to know when new pages are added or old pages removed from agency websites, but finding that info is hard! At the moment, it’s impractical to do a full crawl of all the websites we monitor (that would be like half of End-of-Term Archive’s job!). There might be narrower ways to do regular crawls kind of like that, but that work is both high effort and highly speculative.

On the other hand, we do have a nice source of rough data: all the links on the several thousand pages we already track regularly. We should build an index of all known page URLs at sites we monitor based on all the links in the current versions of all pages. We can then re-run that calculation and compare the two lists of links whenever a new capture is imported or just on a regular basis — maybe as part of generating weekly task sheets, when we are already looking at every changed page’s links. This will obviously have a lot of false positives (new links to already-existing pages) and miss lots of new pages (that aren’t directly linked from those we monitor), but it’s still likely to be pretty good. At least much simpler to experiment with than trying to crawl entire sites.

See also https://edgi.slack.com/archives/CFA6LE5GX/p1738883150312579

@Mr0grog Mr0grog moved this to Inbox in Web Monitoring Feb 17, 2025
@Mr0grog Mr0grog moved this from Inbox to Backlog in Web Monitoring Feb 17, 2025
@Mr0grog Mr0grog moved this from Backlog to Prioritized in Web Monitoring Feb 17, 2025
@Mr0grog Mr0grog mentioned this issue Feb 19, 2025
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Prioritized
Development

No branches or pull requests

1 participant