[Feature]: Automatically run QA after crawl #2337

rien333 · 2025-01-27T11:04:01Z

What change would you like to see?

Currently, users need to manually start the quality assurance after a crawl has finished.

There's something to be said for creating a toggle that allows users to auto-run the QA as part of a "workflow" (i.e. after a crawl has finished). Some key benefits include:

Benefit 1: Enforcing good practices

Even though the feature is still in beta, QA has already helped us to identify and fix crawl mismatches that otherwise may have flown under the radar. As such, "we" — the archive that employs me — would like to run QA for basically all of our crawls, with the possible exception of test runs/experiments. My initial idea was to make running QA part of our internal webarchiving ruleset. However, simply channeling everyone into "doing the right thing™" by having an option to auto-enable QA runs after crawls seems way less prone to human error.

Benefit 2: Reducing false negatives

As this blog post by the UK Web Archive team also argues, QA is somewhat time sensitive:

the [QA] comparison with the live website should be done soon after crawling has completed, otherwise you may be conducting a comparison on a URL where the content has changed significantly to that which was initially crawled.

However, some workflows may take hours. This requires a lot of babysitting in order to ensure that the QA process is started soon after the crawl has finished — it would be more time efficient if this process could be started automatically. Not doing so also riks reporting crawling mismatches that are not true mismatches — that is, false negatives.

UI Design

Whether or not this option needs to be part of the UI is debatable. Personally, I would be okay with a toggle a Helm chart. I also think that auto-starting QAs by default should be configurable in some way, since not everyone may have a need for this.

Context

See also point n. 8 of #2336, and the heading "Quality Assurance" in a recent blogpost by the UK webarchiving team.

Shrinks99 · 2025-01-28T04:22:45Z

QA has already helped us to identify and fix crawl mismatches that otherwise may have flown under the radar.

Hey! This is incredibly nice to read, we're all excited when we hear about wins from our QA system! We spent a long time trying to get it right :)

I hear you, and having a togglable option for "run QA when workflow completes" does seem like a good idea for all the reasons you mention. I would like us to address running it on a subset of pages first however to fix some of the other reasons you mention (some crawls are big and this may have diminishing returns VS the compute cost). I would absolutely want to see this as a feature within the UI as it is something you would want to configure on a per-workflow basis (with the option to set it for all workflows using an org's default template of course).

As this blog post by the UK Web Archive team also argues, QA is somewhat time sensitive

I'll also take this opportunity to issue a correction as the UK Web Archive blog is mistaken on this point. Running QA does not compare the pages against their live versions, rather it compares the result of the replay against the screenshots taken and text extracted at the time of crawling. You can run QA long after the fact and it should deliver largely the same results! Any changes should be due to improvements (or regressions) from ReplayWeb.page, or Browsertrix Crawler and the version of the browser it is using (if changed).

rien333 · 2025-01-28T08:48:29Z

Running QA does not compare the pages against their live versions, rather it compares the result of the replay against the screenshots taken and text extracted at the time of crawling

Ah darn. I vaguely recall reading it worked this way, but I must have promptly forgot. Sorry for the noise.

rien333 added the enhancement New feature or request label Jan 27, 2025

github-project-automation bot added this to Webrecorder Projects Jan 27, 2025

github-project-automation bot moved this to Triage in Webrecorder Projects Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Automatically run QA after crawl #2337

[Feature]: Automatically run QA after crawl #2337

rien333 commented Jan 27, 2025 •

edited

Loading

Shrinks99 commented Jan 28, 2025 •

edited

Loading

rien333 commented Jan 28, 2025

[Feature]: Automatically run QA after crawl #2337

[Feature]: Automatically run QA after crawl #2337

Comments

rien333 commented Jan 27, 2025 • edited Loading

What change would you like to see?

Benefit 1: Enforcing good practices

Benefit 2: Reducing false negatives

UI Design

Context

Shrinks99 commented Jan 28, 2025 • edited Loading

rien333 commented Jan 28, 2025

rien333 commented Jan 27, 2025 •

edited

Loading

Shrinks99 commented Jan 28, 2025 •

edited

Loading