Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backoff strategy to auto retry of failing CI jobs #289

Open
erikmd opened this issue Jul 12, 2023 · 7 comments
Open

Add backoff strategy to auto retry of failing CI jobs #289

erikmd opened this issue Jul 12, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@erikmd
Copy link
Member

erikmd commented Jul 12, 2023

Salut @Zimmi48

FYI despite the line https://github.com/coq/bot/blob/master/src/actions.ml#L545
over the 49 jobs of docker-mathcomp https://gitlab.inria.fr/math-comp/docker-mathcomp/-/pipelines/833107/
there's always a dozen of jobs that fail, and that we have to retry by hand
cf. the list of all jobs https://gitlab.inria.fr/math-comp/docker-mathcomp/-/pipelines/833107/builds

Three questions:

  1. can you verify that the rule || test "unexpected status code .*: 401 Unauthorized" has indeed be used according to the logs, e.g. after this job deploy_1_2.0.0-coq-8.16 https://gitlab.inria.fr/math-comp/docker-mathcomp/-/jobs/3279705
  2. if yes, what happens if coqbot has reached the retry threshold, and we manually restart the job in gitlab ci? → I'd expect that the retry threshold count is reinitialized, so that coqbot can retry anew as if no failure had occur beforehand.
  3. in any case, I guess we could implement some exponential backoff strategy, maybe this algorithm?

With kind regards

@Zimmi48
Copy link
Member

Zimmi48 commented Jul 12, 2023

I can indeed see a bunch of lines like this in the log:

Jul 11 21:28:30Z Failed job 3279753 of project 44938.
Jul 11 21:28:30Z Failure reason: runner_system_failure
Jul 11 21:28:30Z Runner failure reported by GitLab CI... Checking whether to retry the job.
Jul 11 21:28:30Z The job has been retried less than three times before (number of retries = 0). Retrying...

Then:

Jul 11 21:39:28Z Connectivity issue... Checking whether to retry the job.
Jul 11 21:39:29Z The job has been retried less than three times before (number of retries = 1). Retrying...
Jul 11 21:39:29Z Response code: 201.

Then:

Jul 11 22:13:39Z Connectivity issue... Checking whether to retry the job.
Jul 11 22:13:39Z The job has been retried less than three times before (number of retries = 2). Retrying...
Jul 11 22:13:40Z Response code: 201.

Then:

Jul 11 22:43:37Z Connectivity issue... Checking whether to retry the job.
Jul 11 22:43:38Z The job has been retried 3 times before. Not retrying.

if yes, what happens if coqbot has reached the retry threshold, and we manually restart the job in gitlab ci? → I'd expect that the retry threshold count is reinitialized, so that coqbot can retry anew as if no failure had occur beforehand.

The retry threshold is not reinitialized because it is a count that is provided by GitLab. It doesn't distinguish who did the retry. However, adopting an exponential backoff strategy based on the number of retries should be quite doable. In this case, we may be able to increase the maximum number of retries.

I don't think that it is necessary to write an algorithm as complex as the one proposed by Google though. We do have some exponential backoff strategies in place in other parts of the code already:

bot/src/actions.ml

Lines 2594 to 2621 in 77b3898

let rec adjust_milestone ~bot_info ~issue ~sleep_time =
(* We implement an exponential backoff strategy to try again after
5, 25, and 125 seconds, if the issue was closed by a commit not
yet associated to a pull request or if we couldn't find the close
event. *)
GitHub_queries.get_issue_closer_info ~bot_info issue
>>= function
| Ok (ClosedByPullRequest result) ->
GitHub_mutations.reflect_pull_request_milestone ~bot_info result
| Ok ClosedByCommit when Float.(sleep_time > 200.) ->
Lwt_io.print "Closed by commit not associated to any pull request.\n"
| Ok NoCloseEvent when Float.(sleep_time > 200.) ->
Lwt_io.printf "Error: no close event after 200 seconds.\n"
| Ok (ClosedByCommit | NoCloseEvent) ->
(* May be worth trying again later. *)
Lwt_io.printf
"Closed by commit not yet associated to any pull request or no close \
event yet...\n\
\ Trying again in %f seconds.\n"
sleep_time
>>= (fun () -> Lwt_unix.sleep sleep_time)
>>= fun () ->
adjust_milestone ~issue ~sleep_time:(sleep_time *. 5.) ~bot_info
| Ok ClosedByOther ->
(* Not worth trying again *)
Lwt_io.print "Not closed by pull request or commit.\n"
| Error err ->
Lwt_io.print (f "Error: %s\n" err)

@erikmd
Copy link
Member Author

erikmd commented Jul 12, 2023

Thanks @Zimmi48 for your feedback! 👍👍

The retry threshold is not reinitialized because it is a count that is provided by GitLab.

OK, but then I don't see why it would be impossible to implement the feature I was suggesting:

assuming the max_number_of_retry = 3 (retry after 5s, 25s, 125s, then stop) and count := 1 initially, one could have a predicate making it possible to retry manually, and relaunch a series of exponentiel backoff: e.g.,

let should_auto_retry count =
  (count - 1) mod (max_number_of_retry + 1) < max_number_of_retry

as a result

failed with count should_auto_retry?
1 true
2 true
3 true
4 false (manual retry needed)
5 true
6 true
7 true
8 false …

IMO, it would be great to be able to combine this modular-arithmetic idea with the (5s, 25s, 125s) backoff strategy you mention.

WDYT?

@Zimmi48
Copy link
Member

Zimmi48 commented Jul 12, 2023

Relying on modular arithmetic for this purpose is really smart 😮
Yes, I think what you propose would work!

@erikmd
Copy link
Member Author

erikmd commented Jul 12, 2023

Thanks @Zimmi48 !

BTW while we are on it, even if we are sure there will be no "looping",
maybe it would be more "safe" to add another bound, to ensure that the automatic retry by coqbot won't exceed a reasonable bound?

e.g. if max_number_of_retry = 3 and max_number_of_retry_series = 4,

let should_auto_retry count =
  let c = count - 1 in
  let m = max_number_of_retry + 1 in
  let q, r = c / m, c mod m in
  r < max_number_of_retry && q < max_number_of_retries_series

but YMMV!

@silene
Copy link

silene commented Sep 1, 2023

Copied from Zulip:

It is great that you have exponential backoff, but you are missing an important piece. You need to make the sleep length random. For example, if your current backoff is 25 seconds, you should not sleep for 25 seconds; you should sleep between 25 and 25*2=50 seconds. Otherwise, your 49 failing jobs will all wake up at the exact same time, and thus they will fail again.

@Zimmi48 Zimmi48 changed the title gitlab.inria.fr/math-comp/docker-mathcomp: the current backoff does not seem enough Add backoff strategy to auto retry of failing CI jobs Feb 5, 2024
@erikmd
Copy link
Member Author

erikmd commented Jul 2, 2024

FTR, today (2024-07-02) in a 1-job pipeline, gitlab.inria.fr failed with:

https://gitlab.inria.fr/math-comp/docker-mathcomp/-/jobs/4509247

(...)
$ pwd
/builds/2mk6rsew/0/math-comp/docker-mathcomp
$ /usr/bin/env bash -e -c ' # collapsed multi-line command
bash
Error response from daemon: Get "https://registry-1.docker.io/v2/": received unexpected HTTP status: 502 Bad Gateway

Restarted the job manually

@erikmd
Copy link
Member Author

erikmd commented Jul 13, 2024

FYI @Zimmi48 : @palmskog and I we got new 502 errors at docker pull time with this message ↑

I believe this form of the error message is not covered in

bot/src/actions.ml

Lines 510 to 521 in 53ef64c

else if
test "transfer closed with outstanding read data remaining"
|| test "HTTP request sent, awaiting response... 50[0-9]"
|| test "The requested URL returned error: 502"
(*|| test "[Tt]he remote end hung up unexpectedly"*)
(* Can happen with (actual) issues with overlays. *)
|| test "error: unable to download 'https://cache.nixos.org/"
|| test "fatal: unable to access .* Couldn't connect to server"
|| test "fatal: unable to access .* Could not resolve host"
|| test "Resolving .* failed: Temporary failure in name resolution"
|| test "unexpected status code .*: 401 Unauthorized"
then Retry "Connectivity issue"

do you think we should add
|| test "received unexpected HTTP status: 502
or
|| test "received unexpected HTTP status: 502.*
in coq/bot ?

Edit: I opened a PR with a more general regexp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants