Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry Improvements + Rate Limit Support #758

Open
ikreymer opened this issue Feb 7, 2025 · 0 comments
Open

Retry Improvements + Rate Limit Support #758

ikreymer opened this issue Feb 7, 2025 · 0 comments

Comments

@ikreymer
Copy link
Member

ikreymer commented Feb 7, 2025

Following up to #132 (and also #392, #360) , we need a more sophisticated retry strategy, also considering what do with rate limiting status code.
We already have --failOnInvalidStatus, --maxPageRetries, --failOnFailedSeed and --failOnFailedLimit and probably need to add a few more flags.

This is getting slightly messy, but hopefully there's a clear path to figure this out.

There's a few options to consider:

  • Which status code should be counted as page failures, for purposes of ending crawl
  • Which status codes should result in retrying the page
  • Should capture of pages with invalid status codes be skipped when they will be retried.
  • Which status code should result in slowing down the crawl / adding a delay before loading those pages again if retrying..

It's probably useful to list the various use cases:

  • The crawler should treat 4xx and 5xx as failed, possibly customizing which status codes are included?
  • The crawler should fail the crawl if a certain number of pages have failed or if any of the seeds have failed.
  • The crawler should retry failed pages a certain number of times, possibly customizing which status codes are eligible for retries.
  • The crawler should not write any data for pages that are being retried, until the final retry.

With this in mind, probably should add at least a:

  • --retryStatusCodes flag which indicates which status codes will be retried.
  • Is there a need to also specify --invalidStatusCodes that is separate from --retryStatusCodes? Leaning against it.
  • Is there a need to also specify if failed pages that are being retried should be captured to WARC? Sort of leaning against it as well, since retries are part of the capture process
  • How to handle rate limiting, eg. add exponential backoff via pageExtraDelay for certain status codes, like 429, 503 maybe 403.. Possibly using Retry-After, if available (from Slow down + retry on HTTP 429 errors #392)
@ikreymer ikreymer changed the title Retry + Rate Limit Improvements Retry Improvements + Rate Limit Support Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

1 participant