Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: retry the getDirectory request on DNS errors #1280

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andir
Copy link

@andir andir commented Nov 3, 2020

In some situations, even thought you have proper start-up ordering, DNS might briefly be unavailable when the lego units are started.
This is especially critical as systemd doesn't enforce ordering after targets (e.g. the nss-lookup.target) when configuration is being changed and both the local DNS resolver & lego are being restarted at the same time.

In some situations, even thought you have proper start-up ordering, DNS
might briefly be unavailable when the lego units are started. This is
especially critical as systemd doesn't enforce ordering after targets
(e.g. the nss-lookup.target) when configuration is being changed and
both the local DNS resolver & lego are being restarted at the same time.
@ldez ldez self-requested a review November 3, 2020 21:34
@ldez
Copy link
Member

ldez commented Nov 3, 2020

Hello,

the getDirectory is not related to a DNS server but to the ACME server, so I don't understand your PR.

it seems to me that you're just trying to put a delay before launching lego.

@andir
Copy link
Author

andir commented Nov 3, 2020

Hello,

the getDirectory is not related to a DNS server but to the ACME server, so I don't understand your PR.

it seems to me that you're just trying to put a delay before launching lego.

This is about a local DNS resolver on the machine that is being queries as part of the request to caDirURL. This is required as the application has to resolve the DNS name of the letsencrypt API endpoint to one (or a list of) IP addresses.

I am explicitly not trying to add a fixed delay before the startup of lego. What I am trying to cope here is a transient error with the local DNS server. In this case a local recursor might not yet be able to answer the query (in time) as it might still be busy binding to sockets, resolving the root zone, …. The query then fails (in one way or another).

During system startup the unit ordering (Before, After, WantedBy, …) of systemd are sufficient to have services come up in the right order. While the system is running there is apparently no way to have a defined order of unit reloading/restarting and thus the local recursive DNS server might be reloaded in the very moment you are trying to deploy new certificates. This happens when you deploy a new set of configuration files and systemd units to request new certificates are prerequisite to reloading the webserver.

Gracefully handling this transient error as proposed in this PR looks like the most reasonable way to mitigate it at this point in time.

We've been running into this issue on NixOS whenever we reload multiple system services at the same time. lego would fail with a message such as the following:

Nov 03 21:14:51 bertha systemd[1]: Starting Renew ACME certificate for host.tld...
Nov 03 21:14:52 bertha acme-host.tld-start[11959]: 2020/11/03 21:14:52 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: no such host
Nov 03 21:14:52 bertha systemd[1]: acme-epsilon.host.tld.service: Main process exited, code=exited, status=1/FAILURE

@ldez
Copy link
Member

ldez commented Nov 3, 2020

Wouldn't it be more interesting if you added this validation before launching lego?

The package api is the core ACME API, I am very reluctant to add this kind of validation within this package.

@andir
Copy link
Author

andir commented Nov 3, 2020

That would probably be one potential workaround for the issue at hand. My reasoning for having it in here is that software should be resilient against flaky network connections in general.

Grepping through the code and seeing the given example (below) made me believe that more graceful error handling would fit as it it has been done previously already:

lego/acme/api/api.go

Lines 80 to 117 in 77f171f

func (a *Core) retrievablePost(uri string, content []byte, response interface{}) (*http.Response, error) {
// during tests, allow to support ~90% of bad nonce with a minimum of attempts.
bo := backoff.NewExponentialBackOff()
bo.InitialInterval = 200 * time.Millisecond
bo.MaxInterval = 5 * time.Second
bo.MaxElapsedTime = 20 * time.Second
ctx, cancel := context.WithCancel(context.Background())
var resp *http.Response
operation := func() error {
var err error
resp, err = a.signedPost(uri, content, response)
if err != nil {
// Retry if the nonce was invalidated
var e *acme.NonceError
if errors.As(err, &e) {
return err
}
cancel()
return err
}
return nil
}
notify := func(err error, duration time.Duration) {
log.Infof("retry due to: %v", err)
}
err := backoff.RetryNotify(operation, backoff.WithContext(bo, ctx), notify)
if err != nil {
return resp, err
}
return resp, nil
}

While a check before executing the application would help with issues present during startup but having the proposed changes in the code would make the entire code more stable - regardless of any previously ran one-off checks.

@ldez
Copy link
Member

ldez commented Nov 3, 2020

retrievablePost is related to nonce errors: these kinds of retries are expected by the ACME server. (in pebble there is an option only to validate this behavior), it's not comparable to your PR.

You added this retry only because currently it is the first call that is made but if I change the first call this modification will not work anymore, so it doesn't give me a feeling of improved stability.

I understand your issue, but I still think that your proposal is not the right way to do to that.

@andir
Copy link
Author

andir commented Nov 3, 2020

Ok, I guess applying retry logic to all requests could work then? Where would be the best place in the code to handle this in your opinion?

@ldez
Copy link
Member

ldez commented Nov 3, 2020

Ok, I guess applying retry logic to all requests could work then?

I didn't mean to imply that at all.

lego is mainly used as a lib, and it's also a CLI tool, I don't think that the lego lib has to handle this kind of issue.

I think that lego doesn't have to handle that, or maybe only in the CLI context.

@flokli
Copy link

flokli commented Nov 4, 2020

Hm, having some sort of retry behaviour in libraries when they deal with network access seems to be not too uncontroversial - see https://godoc.org/cloud.google.com/go/storage:

All of the methods of this package use exponential backoff to retry calls that fail with certain errors, as described in https://cloud.google.com/storage/docs/exponential-backoff. Retrying continues indefinitely unless the controlling context is canceled or the client is closed. See context.WithTimeout and context.WithCancel.

I'm not entirely set on the "we'll retry indefinitely unless context is called with timeouts" part - some people might not be fully aware of contexts and did rely on things to eventually error - but at least the "we abstract some network flakyness away from users of a library, so all consumers don't need to implement this by themselves" doesn't sound too bad to me to add into lego the library.

@ldez
Copy link
Member

ldez commented Nov 4, 2020

I use the word "context" to talk about a scope/context, not a Go Context, I don't want to add a timeout or deadline.

@flokli
Copy link

flokli commented Nov 4, 2020

Yes, I understood.

I don't want to add a timeout or deadline.

Just to make sure I correctly understand that part - so you would prefer an indefinite exponential backoff in the library part over some hardcoded/configurable timeouts?

Or you still don't want any backoff and retry logic in the library, and every consumer of the library(including the lego CLI) should write that logic by themselves?

@ldez
Copy link
Member

ldez commented Nov 4, 2020

Just to make sure I correctly understand that part - so you would prefer an indefinite exponential backoff in the library part over some hardcoded/configurable timeouts?

You're simplifying what I'm saying, this is not what I mean.

Or you still don't want any backoff and retry logic in the library, and every consumer of the library(including the lego CLI) should write that logic by themselves?

I'm talking in the scope of this PR: to know if a DNS is ready before making queries.
Based on that, a discussion about a global retry system on the core API is a bit off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants