Track seen URLs at all the domains we monitor #173

Mr0grog · 2025-02-07T08:06:34Z

We’d like to know when new pages are added or old pages removed from agency websites, but finding that info is hard! At the moment, it’s impractical to do a full crawl of all the websites we monitor (that would be like half of End-of-Term Archive’s job!). There might be narrower ways to do regular crawls kind of like that, but that work is both high effort and highly speculative.

On the other hand, we do have a nice source of rough data: all the links on the several thousand pages we already track regularly. We should build an index of all known page URLs at sites we monitor based on all the links in the current versions of all pages. We can then re-run that calculation and compare the two lists of links whenever a new capture is imported or just on a regular basis — maybe as part of generating weekly task sheets, when we are already looking at every changed page’s links. This will obviously have a lot of false positives (new links to already-existing pages) and miss lots of new pages (that aren’t directly linked from those we monitor), but it’s still likely to be pretty good. At least much simpler to experiment with than trying to crawl entire sites.

See also https://edgi.slack.com/archives/CFA6LE5GX/p1738883150312579

Mr0grog added [priority-★☆☆] idea labels Feb 7, 2025

Mr0grog moved this to Inbox in Web Monitoring Feb 17, 2025

Mr0grog added this to Web Monitoring Feb 17, 2025

Mr0grog moved this from Inbox to Backlog in Web Monitoring Feb 17, 2025

Mr0grog moved this from Backlog to Prioritized in Web Monitoring Feb 17, 2025

Mr0grog mentioned this issue Feb 19, 2025

2025 Q1 Roadmap #174

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track seen URLs at all the domains we monitor #173

Track seen URLs at all the domains we monitor #173

Mr0grog commented Feb 7, 2025 •

edited

Loading

Track seen URLs at all the domains we monitor #173

Track seen URLs at all the domains we monitor #173

Comments

Mr0grog commented Feb 7, 2025 • edited Loading

Mr0grog commented Feb 7, 2025 •

edited

Loading