Add backoff strategy to auto retry of failing CI jobs #289

erikmd · 2023-07-12T11:01:00Z

FYI despite the line https://github.com/coq/bot/blob/master/src/actions.ml#L545
over the 49 jobs of docker-mathcomp https://gitlab.inria.fr/math-comp/docker-mathcomp/-/pipelines/833107/
there's always a dozen of jobs that fail, and that we have to retry by hand
cf. the list of all jobs https://gitlab.inria.fr/math-comp/docker-mathcomp/-/pipelines/833107/builds

Three questions:

can you verify that the rule || test "unexpected status code .*: 401 Unauthorized" has indeed be used according to the logs, e.g. after this job deploy_1_2.0.0-coq-8.16 https://gitlab.inria.fr/math-comp/docker-mathcomp/-/jobs/3279705
if yes, what happens if coqbot has reached the retry threshold, and we manually restart the job in gitlab ci? → I'd expect that the retry threshold count is reinitialized, so that coqbot can retry anew as if no failure had occur beforehand.
in any case, I guess we could implement some exponential backoff strategy, maybe this algorithm?

With kind regards

The text was updated successfully, but these errors were encountered:

Zimmi48 · 2023-07-12T13:45:36Z

I can indeed see a bunch of lines like this in the log:

Jul 11 21:28:30Z Failed job 3279753 of project 44938.
Jul 11 21:28:30Z Failure reason: runner_system_failure
Jul 11 21:28:30Z Runner failure reported by GitLab CI... Checking whether to retry the job.
Jul 11 21:28:30Z The job has been retried less than three times before (number of retries = 0). Retrying...

Then:

Jul 11 21:39:28Z Connectivity issue... Checking whether to retry the job.
Jul 11 21:39:29Z The job has been retried less than three times before (number of retries = 1). Retrying...
Jul 11 21:39:29Z Response code: 201.

Then:

Jul 11 22:13:39Z Connectivity issue... Checking whether to retry the job.
Jul 11 22:13:39Z The job has been retried less than three times before (number of retries = 2). Retrying...
Jul 11 22:13:40Z Response code: 201.

Then:

Jul 11 22:43:37Z Connectivity issue... Checking whether to retry the job.
Jul 11 22:43:38Z The job has been retried 3 times before. Not retrying.

if yes, what happens if coqbot has reached the retry threshold, and we manually restart the job in gitlab ci? → I'd expect that the retry threshold count is reinitialized, so that coqbot can retry anew as if no failure had occur beforehand.

The retry threshold is not reinitialized because it is a count that is provided by GitLab. It doesn't distinguish who did the retry. However, adopting an exponential backoff strategy based on the number of retries should be quite doable. In this case, we may be able to increase the maximum number of retries.

I don't think that it is necessary to write an algorithm as complex as the one proposed by Google though. We do have some exponential backoff strategies in place in other parts of the code already:

bot/src/actions.ml

Lines 2594 to 2621 in 77b3898

    
           let rec adjust_milestone ~bot_info ~issue ~sleep_time = 
        
             (* We implement an exponential backoff strategy to try again after 
        
                5, 25, and 125 seconds, if the issue was closed by a commit not 
        
                yet associated to a pull request or if we couldn't find the close 
        
                event. *) 
        
             GitHub_queries.get_issue_closer_info ~bot_info issue 
        
             >>= function 
        
             | Ok (ClosedByPullRequest result) -> 
        
                 GitHub_mutations.reflect_pull_request_milestone ~bot_info result 
        
             | Ok ClosedByCommit when Float.(sleep_time > 200.) -> 
        
                 Lwt_io.print "Closed by commit not associated to any pull request.\n" 
        
             | Ok NoCloseEvent when Float.(sleep_time > 200.) -> 
        
                 Lwt_io.printf "Error: no close event after 200 seconds.\n" 
        
             | Ok (ClosedByCommit | NoCloseEvent) -> 
        
                 (* May be worth trying again later. *) 
        
                 Lwt_io.printf 
        
                   "Closed by commit not yet associated to any pull request or no close \ 
        
                    event yet...\n\ 
        
                   \ Trying again in %f seconds.\n" 
        
                   sleep_time 
        
                 >>= (fun () -> Lwt_unix.sleep sleep_time) 
        
                 >>= fun () -> 
        
                 adjust_milestone ~issue ~sleep_time:(sleep_time *. 5.) ~bot_info 
        
             | Ok ClosedByOther -> 
        
                 (* Not worth trying again *) 
        
                 Lwt_io.print "Not closed by pull request or commit.\n" 
        
             | Error err -> 
        
                 Lwt_io.print (f "Error: %s\n" err)

erikmd · 2023-07-12T16:06:12Z

Thanks @Zimmi48 for your feedback! 👍👍

The retry threshold is not reinitialized because it is a count that is provided by GitLab.

OK, but then I don't see why it would be impossible to implement the feature I was suggesting:

assuming the max_number_of_retry = 3 (retry after 5s, 25s, 125s, then stop) and count := 1 initially, one could have a predicate making it possible to retry manually, and relaunch a series of exponentiel backoff: e.g.,

let should_auto_retry count =
  (count - 1) mod (max_number_of_retry + 1) < max_number_of_retry

as a result

failed with count	should_auto_retry?
1	true
2	true
3	true
4	false (manual retry needed)
5	true
6	true
7	true
8	false …

IMO, it would be great to be able to combine this modular-arithmetic idea with the (5s, 25s, 125s) backoff strategy you mention.

WDYT?

Zimmi48 · 2023-07-12T16:18:51Z

Relying on modular arithmetic for this purpose is really smart 😮
Yes, I think what you propose would work!

erikmd · 2023-07-12T20:13:29Z

Thanks @Zimmi48 !

BTW while we are on it, even if we are sure there will be no "looping",
maybe it would be more "safe" to add another bound, to ensure that the automatic retry by coqbot won't exceed a reasonable bound?

e.g. if max_number_of_retry = 3 and max_number_of_retry_series = 4,

let should_auto_retry count =
  let c = count - 1 in
  let m = max_number_of_retry + 1 in
  let q, r = c / m, c mod m in
  r < max_number_of_retry && q < max_number_of_retries_series

but YMMV!

silene · 2023-09-01T16:04:38Z

Copied from Zulip:

It is great that you have exponential backoff, but you are missing an important piece. You need to make the sleep length random. For example, if your current backoff is 25 seconds, you should not sleep for 25 seconds; you should sleep between 25 and 25*2=50 seconds. Otherwise, your 49 failing jobs will all wake up at the exact same time, and thus they will fail again.

erikmd · 2024-07-02T12:04:37Z

FTR, today (2024-07-02) in a 1-job pipeline, gitlab.inria.fr failed with:

https://gitlab.inria.fr/math-comp/docker-mathcomp/-/jobs/4509247

(...)
$ pwd
/builds/2mk6rsew/0/math-comp/docker-mathcomp
$ /usr/bin/env bash -e -c ' # collapsed multi-line command
bash
Error response from daemon: Get "https://registry-1.docker.io/v2/": received unexpected HTTP status: 502 Bad Gateway

Restarted the job manually

erikmd · 2024-07-13T10:21:43Z

FYI @Zimmi48 : @palmskog and I we got new 502 errors at docker pull time with this message ↑

I believe this form of the error message is not covered in

bot/src/actions.ml

Lines 510 to 521 in 53ef64c

    
           else if 
        
             test "transfer closed with outstanding read data remaining" 
        
             || test "HTTP request sent, awaiting response... 50[0-9]" 
        
             || test "The requested URL returned error: 502" 
        
             (*|| test "[Tt]he remote end hung up unexpectedly"*) 
        
             (* Can happen with (actual) issues with overlays. *) 
        
             || test "error: unable to download 'https://cache.nixos.org/" 
        
             || test "fatal: unable to access .* Couldn't connect to server" 
        
             || test "fatal: unable to access .* Could not resolve host" 
        
             || test "Resolving .* failed: Temporary failure in name resolution" 
        
             || test "unexpected status code .*: 401 Unauthorized" 
        
           then Retry "Connectivity issue"

do you think we should add
|| test "received unexpected HTTP status: 502
or
|| test "received unexpected HTTP status: 502.*
in coq/bot ?

Edit: I opened a PR with a more general regexp.

ticket-tagger bot added the bug Something isn't working label Jul 12, 2023

erikmd mentioned this issue Jul 12, 2023

Upgrade to Dune ≥ 3.8.2 (namely: 3.9.1) coq-community/docker-coq#57

Closed

Zimmi48 changed the title ~~gitlab.inria.fr/math-comp/docker-mathcomp: the current backoff does not seem enough~~ Add backoff strategy to auto retry of failing CI jobs Feb 5, 2024

erikmd mentioned this issue Jul 14, 2024

Rebuild if error "received unexpected HTTP status: 502 Bad Gateway" is raised #307

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backoff strategy to auto retry of failing CI jobs #289

Add backoff strategy to auto retry of failing CI jobs #289

erikmd commented Jul 12, 2023 •

edited

Loading

Zimmi48 commented Jul 12, 2023

erikmd commented Jul 12, 2023 •

edited

Loading

Zimmi48 commented Jul 12, 2023

erikmd commented Jul 12, 2023 •

edited

Loading

silene commented Sep 1, 2023

erikmd commented Jul 2, 2024 •

edited

Loading

erikmd commented Jul 13, 2024 •

edited

Loading

Add backoff strategy to auto retry of failing CI jobs #289

Add backoff strategy to auto retry of failing CI jobs #289

Comments

erikmd commented Jul 12, 2023 • edited Loading

Zimmi48 commented Jul 12, 2023

erikmd commented Jul 12, 2023 • edited Loading

Zimmi48 commented Jul 12, 2023

erikmd commented Jul 12, 2023 • edited Loading

silene commented Sep 1, 2023

erikmd commented Jul 2, 2024 • edited Loading

erikmd commented Jul 13, 2024 • edited Loading

erikmd commented Jul 12, 2023 •

edited

Loading

erikmd commented Jul 12, 2023 •

edited

Loading

erikmd commented Jul 12, 2023 •

edited

Loading

erikmd commented Jul 2, 2024 •

edited

Loading

erikmd commented Jul 13, 2024 •

edited

Loading