You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We’d like to know when new pages are added or old pages removed from agency websites, but finding that info is hard! At the moment, it’s impractical to do a full crawl of all the websites we monitor (that would be like half of End-of-Term Archive’s job!). There might be narrower ways to do regular crawls kind of like that, but that work is both high effort and highly speculative.
On the other hand, we do have a nice source of rough data: all the links on the several thousand pages we already track regularly. We should build an index of all known page URLs at sites we monitor based on all the links in the current versions of all pages. We can then re-run that calculation and compare the two lists of links whenever a new capture is imported or just on a regular basis — maybe as part of generating weekly task sheets, when we are already looking at every changed page’s links. This will obviously have a lot of false positives (new links to already-existing pages) and miss lots of new pages (that aren’t directly linked from those we monitor), but it’s still likely to be pretty good. At least much simpler to experiment with than trying to crawl entire sites.
We’d like to know when new pages are added or old pages removed from agency websites, but finding that info is hard! At the moment, it’s impractical to do a full crawl of all the websites we monitor (that would be like half of End-of-Term Archive’s job!). There might be narrower ways to do regular crawls kind of like that, but that work is both high effort and highly speculative.
On the other hand, we do have a nice source of rough data: all the links on the several thousand pages we already track regularly. We should build an index of all known page URLs at sites we monitor based on all the links in the current versions of all pages. We can then re-run that calculation and compare the two lists of links whenever a new capture is imported or just on a regular basis — maybe as part of generating weekly task sheets, when we are already looking at every changed page’s links. This will obviously have a lot of false positives (new links to already-existing pages) and miss lots of new pages (that aren’t directly linked from those we monitor), but it’s still likely to be pretty good. At least much simpler to experiment with than trying to crawl entire sites.
See also https://edgi.slack.com/archives/CFA6LE5GX/p1738883150312579
The text was updated successfully, but these errors were encountered: