Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose issuer certificate TTL as a prometheus metric #13615

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

n-oden
Copy link
Contributor

@n-oden n-oden commented Jan 30, 2025

Problem: There is currently no simple way to monitor the expiration time of the issuer certificate in use by linkerd; a surprising omission considering that issuer cert expiration will almost certainly cause visible cluster issues.

Solution:

  • When a new issuer certificate is loaded, log its NotAfter time in unix epoch format, along with the current process wall clock time. The two timestamps are passed in via the logrus Fields pattern, allowing operators to easily pull these numbers from pod logs.
  • Register a prometheus gauge function metric to expose the TTL for monitoring

Fixes: #11215

@n-oden n-oden requested a review from a team as a code owner January 30, 2025 19:10
@n-oden
Copy link
Contributor Author

n-oden commented Jan 30, 2025

cc: @whickman :)

@n-oden
Copy link
Contributor Author

n-oden commented Jan 30, 2025

(IMO it would be somewhat preferable to expose this as a prometheus metric, but to put it mildly I found the internal metrics story here opaque. If someone wanted to hold my hand a bit, I'd happily add it. But in the meantime, the log line should suffice for the basic case.)

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @n-oden , this looks good to me 👍
For adding the metric, you can leverage the promauto package, like in this example in the Destination controller:
https://github.com/linkerd/linkerd2/blob/edge-25.1.2/controller/api/destination/watcher/cluster_store.go#L102-L105
You could add a new field in the Service struct tracking the cert expiry (that would get updated in loadCredentials) and wire the metric in the NewService function.

@n-oden n-oden force-pushed the log-issuer-expiry-time branch from e075e98 to e50f621 Compare February 4, 2025 18:30
@n-oden
Copy link
Contributor Author

n-oden commented Feb 4, 2025

@alpeb thank you very much for the pointer! I've taken a stab at it. I may have over-thought the initialization logic; let me know. :)

@n-oden n-oden force-pushed the log-issuer-expiry-time branch 4 times, most recently from 972a481 to bc69026 Compare February 4, 2025 20:50
@n-oden n-oden changed the title Log issuer certificate expiry expose issuer certificate TTL as a prometheus metrics Feb 4, 2025
@n-oden n-oden changed the title expose issuer certificate TTL as a prometheus metrics expose issuer certificate TTL as a prometheus metric Feb 4, 2025
@n-oden n-oden force-pushed the log-issuer-expiry-time branch from bc69026 to 0a9dcbc Compare February 4, 2025 21:00
@n-oden
Copy link
Contributor Author

n-oden commented Feb 4, 2025

@alpeb for some reason trying to use promauto failed in the test suite because it attempted to register the gauge function twice; I've replaced it with a more traditional use of prometheus.Register().

The failing tests seem to be a github actions issue; I cannot reproduce them locally:

=== Skipped
=== SKIP: viz/cmd TestRequestTapByResourceFromAPI/Should_return_error_if_stream_returned_error (0.00s)
    --- SKIP: TestRequestTapByResourceFromAPI/Should_return_error_if_stream_returned_error (0.00s)

DONE 1063 tests, 1 skipped in 94.401s

@n-oden n-oden requested a review from alpeb February 4, 2025 21:16
@n-oden n-oden force-pushed the log-issuer-expiry-time branch from 0a9dcbc to 02e1111 Compare February 4, 2025 21:22
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yeah, better to be more explicit than relying on magical APIs 😉

Can you however issue a warning log instead of panic'ing? It's not critical if the metric cannot be registered, so better to get an informative error rather than failing the whole controller.

Another thing I believe we don't have metrics for is the trust root certs. If you're up for it, you could include that here or in a separate PR. The trust root might be a bundle of certs, so we could track the ttl for the cert that is the closest to expiry...

N.B.: Tests are green, failures were just flakiness.

pkg/identity/service.go Outdated Show resolved Hide resolved
pkg/identity/service.go Outdated Show resolved Hide resolved
@n-oden n-oden force-pushed the log-issuer-expiry-time branch from 02e1111 to 0fffba0 Compare February 5, 2025 17:00
@n-oden
Copy link
Contributor Author

n-oden commented Feb 5, 2025

I've added a metric for the trust anchor TTL but beware that I couldn't figure out any plausible way to do this that didn't involve the deprecated Subjects() method -- looking at the CertPool documentation I'm not sure how one would go about doing this without touching that method.

@n-oden n-oden force-pushed the log-issuer-expiry-time branch from 0fffba0 to b797f70 Compare February 5, 2025 17:01
@n-oden n-oden requested a review from alpeb February 5, 2025 17:09
@n-oden n-oden force-pushed the log-issuer-expiry-time branch 3 times, most recently from 9bf1010 to a7b45bc Compare February 5, 2025 18:16
@n-oden
Copy link
Contributor Author

n-oden commented Feb 5, 2025

Reading through golang/go#46287 it looks like the intent is for tls.CertPool to be entirely opaque and the answer is "use the system verifier", which... I'm not sure how that is even supposed to apply to a case where you're essentially building your own CA locally? Anyway I tagged it //nolint for now but obviously this will break badly if/when the deprecated method is finally removed so your call whether to leave this in or not.

@n-oden n-oden force-pushed the log-issuer-expiry-time branch from a7b45bc to db98e3c Compare February 6, 2025 05:56
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We require the full certs, not just their subjects. Testing with a plain vanilla linkerd instance shows that can't be parsed:

time="2025-02-06T14:30:07Z" level=error msg="could not parse trust anchor certificate: x509: malformed tbs certificate"
time="2025-02-06T14:30:07Z" level=warning msg="Could not parse any trust anchor certs; cannot get TTL"

It seems CertPool doesn't expose a public API to retrieve its certs 🤷‍♂️
We could have this service receive the raw certs instead of CertPool, but this was just a nice-to-have and not worth the refactoring. So I think it's fine to leave this out and just add the issuer cert stuff... Thanks anyways for digging into this 🙂

@n-oden n-oden force-pushed the log-issuer-expiry-time branch from db98e3c to 7939d37 Compare February 6, 2025 15:10
@n-oden
Copy link
Contributor Author

n-oden commented Feb 6, 2025

Oh well, it was worth a shot. :)

@n-oden n-oden requested a review from alpeb February 6, 2025 15:10
Add a prometheus gauge function in the identity package that exposes
the current TTL in seconds of the issuer certificate.

When a new issuer certificate is loaded, log its NotAfter time
in unix epoch format, along with the current process wall clock time.

This addresses linkerd#11215

Signed-off-by: Nathan J. Mehl <[email protected]>
@n-oden n-oden force-pushed the log-issuer-expiry-time branch from 0a67b6d to 131c890 Compare February 10, 2025 16:19
@n-oden
Copy link
Contributor Author

n-oden commented Feb 10, 2025

Hey @alpeb, hope you had a good weekend. I think this is basically good to go, if you concur?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics for certificate expiry
2 participants