-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api: retry the getDirectory request on DNS errors #1280
base: master
Are you sure you want to change the base?
Conversation
In some situations, even thought you have proper start-up ordering, DNS might briefly be unavailable when the lego units are started. This is especially critical as systemd doesn't enforce ordering after targets (e.g. the nss-lookup.target) when configuration is being changed and both the local DNS resolver & lego are being restarted at the same time.
32afa01
to
77f171f
Compare
Hello, the it seems to me that you're just trying to put a delay before launching lego. |
This is about a local DNS resolver on the machine that is being queries as part of the request to I am explicitly not trying to add a fixed delay before the startup of lego. What I am trying to cope here is a transient error with the local DNS server. In this case a local recursor might not yet be able to answer the query (in time) as it might still be busy binding to sockets, resolving the root zone, …. The query then fails (in one way or another). During system startup the unit ordering (Before, After, WantedBy, …) of systemd are sufficient to have services come up in the right order. While the system is running there is apparently no way to have a defined order of unit reloading/restarting and thus the local recursive DNS server might be reloaded in the very moment you are trying to deploy new certificates. This happens when you deploy a new set of configuration files and systemd units to request new certificates are prerequisite to reloading the webserver. Gracefully handling this transient error as proposed in this PR looks like the most reasonable way to mitigate it at this point in time. We've been running into this issue on NixOS whenever we reload multiple system services at the same time.
|
Wouldn't it be more interesting if you added this validation before launching lego? The package |
That would probably be one potential workaround for the issue at hand. My reasoning for having it in here is that software should be resilient against flaky network connections in general. Grepping through the code and seeing the given example (below) made me believe that more graceful error handling would fit as it it has been done previously already: Lines 80 to 117 in 77f171f
While a check before executing the application would help with issues present during startup but having the proposed changes in the code would make the entire code more stable - regardless of any previously ran one-off checks. |
You added this retry only because currently it is the first call that is made but if I change the first call this modification will not work anymore, so it doesn't give me a feeling of improved stability. I understand your issue, but I still think that your proposal is not the right way to do to that. |
Ok, I guess applying retry logic to all requests could work then? Where would be the best place in the code to handle this in your opinion? |
I didn't mean to imply that at all. lego is mainly used as a lib, and it's also a CLI tool, I don't think that the lego lib has to handle this kind of issue. I think that lego doesn't have to handle that, or maybe only in the CLI context. |
Hm, having some sort of retry behaviour in libraries when they deal with network access seems to be not too uncontroversial - see https://godoc.org/cloud.google.com/go/storage:
I'm not entirely set on the "we'll retry indefinitely unless context is called with timeouts" part - some people might not be fully aware of contexts and did rely on things to eventually error - but at least the "we abstract some network flakyness away from users of a library, so all consumers don't need to implement this by themselves" doesn't sound too bad to me to add into lego the library. |
I use the word "context" to talk about a scope/context, not a Go |
Yes, I understood.
Just to make sure I correctly understand that part - so you would prefer an indefinite exponential backoff in the library part over some hardcoded/configurable timeouts? Or you still don't want any backoff and retry logic in the library, and every consumer of the library(including the lego CLI) should write that logic by themselves? |
You're simplifying what I'm saying, this is not what I mean.
I'm talking in the scope of this PR: to know if a DNS is ready before making queries. |
In some situations, even thought you have proper start-up ordering, DNS might briefly be unavailable when the lego units are started.
This is especially critical as systemd doesn't enforce ordering after targets (e.g. the nss-lookup.target) when configuration is being changed and both the local DNS resolver & lego are being restarted at the same time.