Keep technology detections up to date

WebPageTest integrates with Wappalyzer to get the list of detection rules and run them through the detection engine during each test. The results are parsed from the HAR and written to the `technologies` dataset in the Dataflow pipeline. IIUC the biggest challenge is keeping the engine up to date because it needs to be reimplemented for WebPageTest's environment; it's less of an issue to keep the detection rules up to date for each technology since it's a simple JSON schema. Still, WebPageTest needs to periodically check for updates and stay in sync.

When technology detections are outdated or broken, several HTTP Archive dependencies are affected. Many Web Almanac chapters segment by technologies, like JS, CSS, CMS, and Ecommerce. Additionally, the Core Web Vitals Technology Report is a direct visualization layer on top of the output of the detections, so any bugs would be immediately visible there.

Similar to HTTPArchive/data-pipeline#30, the WebPageTest repo can use automation like GitHub Actions and/or dependabot to keep the rules in sync. But the engine will be much harder because it requires manual integration. At a minimum we need to know when the engine is out of date, using something like a file watcher on the engine's [source code](https://github.com/AliasIO/wappalyzer/tree/master/src). I think it'd be worth connecting with @AliasIO and @tkadlec to brainstorm more reliable ways to keep Wappalyzer and WebPageTest in sync.

There's also more we can do on the HTTP Archive side to try to catch any anomalies late in the pipeline. While it'd be too late to fix broken detections, this should hopefully alert us to the bugs so that we can get them fixed before the next crawl. One thing we can do is look at a subset of individual pages with known technologies, and assert that they're detected correctly month after month. We could also look at the adoption rates in aggregate and flag anything anomalous like a steep rise or drop. The individual page approach has the benefit of being able to alert us ASAP before the crawl is even complete, but it does require some manual curation and upkeep. Not only can these approaches catch bugs arising from version skew across projects, but they can also help catch bugs in the rules/engine itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep technology detections up to date #70

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Keep technology detections up to date #70

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions