Skip to content

flcdrg/wayback-machine-query-action

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Repository files navigation

Wayback machine query GitHub Action

build-test

Allows querying Archive.org's Wayback Machine to locate URLs for archived web pages.

The action makes use of the Wayback Machine Availability API

The input file format is assumed to follow that produced by lychee (e.g. as can be produced by the lycheeverse/lychee-action GitHub action).

If the replacements-path input property is set, then a file is written in JSON format. The data includes a list (missing) of any URLs that did not have snapshots, and a list (replacements) with pairs of the original URL and a corresponding Wayback Machine URL. You could use this data to do a find/replace in the source document(s).

The missing and replacement value are also set as output properties.

If a timestamp is supplied to the Wayback Machine API, then the closest snapshot to that date is returned. The timestamp is obtained via a regular expression that is applied to the input URL. The regular expression must at least return a named group year, and optionally month and day. See the tests for an example.

Motivation

I was looking at adding automated accessibility checking to my blog. In doing this, I discovered I'd accrued a lot of broken links over the years, and these seem to be impacting the accessibility scanning tool. Adding lycheeverse/lychee-action was useful, but it generated a long list of broken links, and I figured there must be a way to automate the fixing of many of the broken links.

Usage

- uses: flcdrg/wayback-machine-query-action@v1
  id: wayback
  with:
    source-path:        # path to input JSON file
    replacements-path:  # (Optional) path to output JSON file
    timestamp-regex:    # (Optional) regex to extract a timestamp from the input URL

Sample input file

{
  "error_map": {
    "dist/2006/12/mbunit-23-rtm.html": [
      {
        "url": "http://weblogs.asp.net/astopford/archive/2006/12/07/mbunit-2-3-rtm.aspx",
        "status": {
          "text": "Network error",
          "code": 404
        }
      }
    ],
    "dist/2005/09/dr-neil-touring-australia.html": [
      {
        "url": "http://blogs.msdn.com/charles_sterling/archive/2005/09/22/472684.aspx",
        "status": {
          "text": "Network error",
          "code": 404
        }
      },
      {
        "url": "http://www.dotnetsolutions.com.au/xp.aspx",
        "status": {
          "text": "Network error",
          "code": 404
        }
      }
    ]
  }
}

Sample output file

{
  "missing": [
    "http://www.sqlsnapshots.com/SQLSnapshotsMP3Feed.xml",
  ],
  "replacements": [
    {
      "find": "http://www.microsoft.com/windowsserver2003/technologies/networking/ipsec/default.mspx#EGAA",
      "replace": "http://web.archive.org/web/20050404133316/http://www.microsoft.com:80/windowsserver2003/technologies/networking/ipsec/default.mspx",
    },
    {
      "find": "http://www.sqlserver.org.au/",
      "replace": "http://web.archive.org/web/20090217192805/http://www.sqlserver.org.au:80/",
    },
    {
      "find": "http://www.microsoft.com/resources/documentation/windowsnt/4/server/reskit/en-us/reskt4u4/rku4list.mspx?mfr=true",
      "replace": "http://web.archive.org/web/20091206231445/http://www.microsoft.com:80/resources/documentation/windowsnt/4/server/reskit/en-us/reskt4u4/rku4list.mspx?mfr=true",
    },
  ],
}

Breaking changes

Lychee 0.18.0 changed the JSON schema. fail_map was renamed to error_map. Version 4 of this action is updated to follow this change.