This is a pilot program to explore how to archive URL references found in CVE Records. This repository and README are early-stage.
Link rot happens, in some cases intentionally and in some cases fairly quickly after a CVE Record is published.. CVE is a valuable, historic, and reasonably comprehensive public data archive. CVE has outlasted many of the original sources of published vulnerability information. It is valuable to archive these sources (public URLs and their content).
An Automation Working Group summary of Summary of AWG discussion and requirements, a must read.
Sildes called CVE Reference Investigations that document some of the extent of the link rot problem, plus a threat vector involving CVE ID typo squatting.
Other slides outlining a somewhat more “in-house” solution (which is not the current plan, but things could change).
A flow chart, not necessarily accurate.
The docs/
directory.
No need to do everything at once, which may even be unwise, as we’ll learn along the way.
ArchiveBbox for local collection, not serving or sharing this collection in Phase 1. So only the project team and Secretariat are likely to have access. ArchiveBox uses the Django development web server that we should probably not run on the internet.
We could also submit references to the Internet Archive Wayback Machine. This can be “fire and forget” or “be nicer and check dead links and check already submitted and recent-enough URLs before submitting.” The Wayback Machine has features to manage duplicate and "overly young" references.
Review/reconsider ArchiveBox, could continue, replace with a different project, replace with in-house software, switch to paid external provider, or stay on-prem. Decide and implement a way to share the local archive. Public web site, torrents, only to registered CNAs/Program members? Consider other public services than the Internet Archive, if such exist.
The future is unclear, but once we have something in place, archiving shouldn’t require a lot of major dynamic changes. Operate, add storage, manage and tweak crawler(s) and external destinations.