-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expose issuer certificate TTL as a prometheus metric #13615
base: main
Are you sure you want to change the base?
Conversation
cc: @whickman :) |
(IMO it would be somewhat preferable to expose this as a prometheus metric, but to put it mildly I found the internal metrics story here opaque. If someone wanted to hold my hand a bit, I'd happily add it. But in the meantime, the log line should suffice for the basic case.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @n-oden , this looks good to me 👍
For adding the metric, you can leverage the promauto
package, like in this example in the Destination controller:
https://github.com/linkerd/linkerd2/blob/edge-25.1.2/controller/api/destination/watcher/cluster_store.go#L102-L105
You could add a new field in the Service
struct tracking the cert expiry (that would get updated in loadCredentials
) and wire the metric in the NewService
function.
e075e98
to
e50f621
Compare
@alpeb thank you very much for the pointer! I've taken a stab at it. I may have over-thought the initialization logic; let me know. :) |
972a481
to
bc69026
Compare
bc69026
to
0a9dcbc
Compare
@alpeb for some reason trying to use promauto failed in the test suite because it attempted to register the gauge function twice; I've replaced it with a more traditional use of prometheus.Register(). The failing tests seem to be a github actions issue; I cannot reproduce them locally:
|
0a9dcbc
to
02e1111
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yeah, better to be more explicit than relying on magical APIs 😉
Can you however issue a warning log instead of panic'ing? It's not critical if the metric cannot be registered, so better to get an informative error rather than failing the whole controller.
Another thing I believe we don't have metrics for is the trust root certs. If you're up for it, you could include that here or in a separate PR. The trust root might be a bundle of certs, so we could track the ttl for the cert that is the closest to expiry...
N.B.: Tests are green, failures were just flakiness.
02e1111
to
0fffba0
Compare
I've added a metric for the trust anchor TTL but beware that I couldn't figure out any plausible way to do this that didn't involve the deprecated Subjects() method -- looking at the CertPool documentation I'm not sure how one would go about doing this without touching that method. |
0fffba0
to
b797f70
Compare
9bf1010
to
a7b45bc
Compare
Reading through golang/go#46287 it looks like the intent is for tls.CertPool to be entirely opaque and the answer is "use the system verifier", which... I'm not sure how that is even supposed to apply to a case where you're essentially building your own CA locally? Anyway I tagged it |
a7b45bc
to
db98e3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We require the full certs, not just their subjects. Testing with a plain vanilla linkerd instance shows that can't be parsed:
time="2025-02-06T14:30:07Z" level=error msg="could not parse trust anchor certificate: x509: malformed tbs certificate"
time="2025-02-06T14:30:07Z" level=warning msg="Could not parse any trust anchor certs; cannot get TTL"
It seems CertPool doesn't expose a public API to retrieve its certs 🤷♂️
We could have this service receive the raw certs instead of CertPool, but this was just a nice-to-have and not worth the refactoring. So I think it's fine to leave this out and just add the issuer cert stuff... Thanks anyways for digging into this 🙂
db98e3c
to
7939d37
Compare
Oh well, it was worth a shot. :) |
Add a prometheus gauge function in the identity package that exposes the current TTL in seconds of the issuer certificate. When a new issuer certificate is loaded, log its NotAfter time in unix epoch format, along with the current process wall clock time. This addresses linkerd#11215 Signed-off-by: Nathan J. Mehl <[email protected]>
0a67b6d
to
131c890
Compare
Hey @alpeb, hope you had a good weekend. I think this is basically good to go, if you concur? |
Problem: There is currently no simple way to monitor the expiration time of the issuer certificate in use by linkerd; a surprising omission considering that issuer cert expiration will almost certainly cause visible cluster issues.
Solution:
Fixes: #11215