Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Automatically run QA after crawl #2337

Open
rien333 opened this issue Jan 27, 2025 · 2 comments
Open

[Feature]: Automatically run QA after crawl #2337

rien333 opened this issue Jan 27, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@rien333
Copy link

rien333 commented Jan 27, 2025

What change would you like to see?

Currently, users need to manually start the quality assurance after a crawl has finished.

There's something to be said for creating a toggle that allows users to auto-run the QA as part of a "workflow" (i.e. after a crawl has finished). Some key benefits include:

Benefit 1: Enforcing good practices

Even though the feature is still in beta, QA has already helped us to identify and fix crawl mismatches that otherwise may have flown under the radar. As such, "we" — the archive that employs me — would like to run QA for basically all of our crawls, with the possible exception of test runs/experiments. My initial idea was to make running QA part of our internal webarchiving ruleset. However, simply channeling everyone into "doing the right thing™" by having an option to auto-enable QA runs after crawls seems way less prone to human error.

Benefit 2: Reducing false negatives

As this blog post by the UK Web Archive team also argues, QA is somewhat time sensitive:

the [QA] comparison with the live website should be done soon after crawling has completed, otherwise you may be conducting a comparison on a URL where the content has changed significantly to that which was initially crawled.

However, some workflows may take hours. This requires a lot of babysitting in order to ensure that the QA process is started soon after the crawl has finished — it would be more time efficient if this process could be started automatically. Not doing so also riks reporting crawling mismatches that are not true mismatches — that is, false negatives.

UI Design

Whether or not this option needs to be part of the UI is debatable. Personally, I would be okay with a toggle a Helm chart. I also think that auto-starting QAs by default should be configurable in some way, since not everyone may have a need for this.

Context

See also point n. 8 of #2336, and the heading "Quality Assurance" in a recent blogpost by the UK webarchiving team.

@Shrinks99
Copy link
Member

Shrinks99 commented Jan 28, 2025

QA has already helped us to identify and fix crawl mismatches that otherwise may have flown under the radar.

Hey! This is incredibly nice to read, we're all excited when we hear about wins from our QA system! We spent a long time trying to get it right :)

I hear you, and having a togglable option for "run QA when workflow completes" does seem like a good idea for all the reasons you mention. I would like us to address running it on a subset of pages first however to fix some of the other reasons you mention (some crawls are big and this may have diminishing returns VS the compute cost). I would absolutely want to see this as a feature within the UI as it is something you would want to configure on a per-workflow basis (with the option to set it for all workflows using an org's default template of course).

As this blog post by the UK Web Archive team also argues, QA is somewhat time sensitive

I'll also take this opportunity to issue a correction as the UK Web Archive blog is mistaken on this point. Running QA does not compare the pages against their live versions, rather it compares the result of the replay against the screenshots taken and text extracted at the time of crawling. You can run QA long after the fact and it should deliver largely the same results! Any changes should be due to improvements (or regressions) from ReplayWeb.page, or Browsertrix Crawler and the version of the browser it is using (if changed).

@rien333
Copy link
Author

rien333 commented Jan 28, 2025

Running QA does not compare the pages against their live versions, rather it compares the result of the replay against the screenshots taken and text extracted at the time of crawling

Ah darn. I vaguely recall reading it worked this way, but I must have promptly forgot. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Triage
Development

No branches or pull requests

2 participants