You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following up to #132 (and also #392, #360) , we need a more sophisticated retry strategy, also considering what do with rate limiting status code.
We already have --failOnInvalidStatus, --maxPageRetries, --failOnFailedSeed and --failOnFailedLimit and probably need to add a few more flags.
This is getting slightly messy, but hopefully there's a clear path to figure this out.
There's a few options to consider:
Which status code should be counted as page failures, for purposes of ending crawl
Which status codes should result in retrying the page
Should capture of pages with invalid status codes be skipped when they will be retried.
Which status code should result in slowing down the crawl / adding a delay before loading those pages again if retrying..
It's probably useful to list the various use cases:
The crawler should treat 4xx and 5xx as failed, possibly customizing which status codes are included?
The crawler should fail the crawl if a certain number of pages have failed or if any of the seeds have failed.
The crawler should retry failed pages a certain number of times, possibly customizing which status codes are eligible for retries.
The crawler should not write any data for pages that are being retried, until the final retry.
With this in mind, probably should add at least a:
--retryStatusCodes flag which indicates which status codes will be retried.
Is there a need to also specify --invalidStatusCodes that is separate from --retryStatusCodes? Leaning against it.
Is there a need to also specify if failed pages that are being retried should be captured to WARC? Sort of leaning against it as well, since retries are part of the capture process
How to handle rate limiting, eg. add exponential backoff via pageExtraDelay for certain status codes, like 429, 503 maybe 403.. Possibly using Retry-After, if available (from Slow down + retry on HTTP 429 errors #392)
The text was updated successfully, but these errors were encountered:
Following up to #132 (and also #392, #360) , we need a more sophisticated retry strategy, also considering what do with rate limiting status code.
We already have --failOnInvalidStatus, --maxPageRetries, --failOnFailedSeed and --failOnFailedLimit and probably need to add a few more flags.
This is getting slightly messy, but hopefully there's a clear path to figure this out.
There's a few options to consider:
It's probably useful to list the various use cases:
With this in mind, probably should add at least a:
Retry-After
, if available (from Slow down + retry on HTTP 429 errors #392)The text was updated successfully, but these errors were encountered: