Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix KubeClientCertificateExpiration alerts #941

Merged

Conversation

7840vz
Copy link
Contributor

@7840vz 7840vz commented Jun 10, 2024

  1. Fix aggregation for on(job) to become (job, cluster, instance). Otherwise, It would be enough to have just single instance with certificate expiration problem, and it would set all apiservers to 'firing' (false positive!).

  2. Also, change aggregation by (le) to without(service,endpoint...), dropping only useless labels, but keeping external labels (like environment etc) intact. Otherwise they get dropped.

  3. Change order of metrics in expression: apiserver_client_certificate_expiration_seconds_bucket metric comes first so actual expiration date is shown as result in Grafana->Explore queries, not apiserver_client_certificate_expiration_seconds_count value (which is quite useless). This make it easier to troubleshoot.

1) Change aggregation `by (le)` to `without(service,endpoint...)`, dropping only useless labels, but keeping external labels (like environment etc) intact. Otherwise they get dropped.
2) Change order of metrics in expression: `apiserver_client_certificate_expiration_seconds_bucket` metric comes first so actual expiration date is shown as result in Grafana->Explore queries, not `apiserver_client_certificate_expiration_seconds_count` value (which is quite useless). This make it easier to troubleshoot.
3) Finally, fix aggregation for `on(job)` to become `(job, cluster, instance)`. Otherwise, It would be enough to have just single instance with certificate expiration problem, and it would set all apiservers to 'firing' (false positive!).
Copy link

github-actions bot commented Sep 4, 2024

This PR has been automatically marked as stale because it has not had any activity in the past 30 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity.
Thank you for your contributions!

@github-actions github-actions bot added the stale label Sep 4, 2024
@skl
Copy link
Collaborator

skl commented Sep 4, 2024

commenting to keep this open a little longer, looks like a genuine issue

@skl skl added bug Something isn't working keepalive Use to prevent automatic closing labels Sep 4, 2024
@skl skl removed the stale label Oct 14, 2024
@skl
Copy link
Collaborator

skl commented Oct 29, 2024

@7840vz would you have a moment to address the conflicts?

@7840vz 7840vz requested a review from povilasv as a code owner November 1, 2024 08:29
@7840vz
Copy link
Contributor Author

7840vz commented Nov 1, 2024

resolved conflicts

@7840vz
Copy link
Contributor Author

7840vz commented Nov 1, 2024

could you check this one as well pls? #942

@lorenzofelletti
Copy link
Contributor

It would be nice to have also the cluster indicated in the alert description (here, and here):

description: 'A client certificate used to authenticate to kubernetes apiserver is expiring in less than %s on cluster {{ $labels.%s }}.' % [(utils.humanizeSeconds($._config.certExpirationWarningSeconds)), $._config.clusterLabel],

@skl
Copy link
Collaborator

skl commented Nov 7, 2024

@lorenzofelletti Having the cluster labels in the description is definitely desirable but not everyone has the cluster label, so might need some logic in the descriptions to check if they have multiCluster enabled or not (I think some alerts do this iirc). I think that can be addressed as a separate issue as there's probably a whole bunch of alerts that would benefit from that.

@skl skl merged commit 3830dfd into kubernetes-monitoring:master Nov 7, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working keepalive Use to prevent automatic closing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants