feat(monitor): operator Prometheus metrics with mTLS#4558
feat(monitor): operator Prometheus metrics with mTLS#4558rene-dekker merged 8 commits intotigera:masterfrom
Conversation
| rules = append(rules, | ||
| monitoringv1.Rule{ | ||
| Alert: "TLSCertExpiringWarning", | ||
| Expr: intstr.FromString("tigera_operator_tls_certificate_expiry_timestamp_seconds - time() < 29 * 24 * 3600"), |
There was a problem hiding this comment.
I picked 29 days, because we automatically renew our own certs 30d before expiry. I wanted to exclude that day to generate fewer alerts.
|
|
||
| // metricsEnabled returns true when the operator metrics endpoint is enabled. | ||
| func metricsEnabled() bool { | ||
| return strings.EqualFold(os.Getenv("METRICS_ENABLED"), "true") |
There was a problem hiding this comment.
Technically a breaking change, since someone might be using metrics today without this set?
But probably not a big deal - easy fix.
There was a problem hiding this comment.
Yes, my rationale was that it is better to be explicit in the enablement, it can be reverted if desired. I would not think that there are that many users of prometheus as there were no metrics other than the out-of-the-box ones before this PR.
| // dynamicCertLoader dynamically loads TLS certificates from Kubernetes secrets | ||
| // for the metrics endpoint. The monitor controller creates the server cert, and | ||
| // the client CA is loaded from the Prometheus client TLS secret. | ||
| type dynamicCertLoader struct { |
There was a problem hiding this comment.
I wonder, do we need all of this complexity?
Couldn't we just use an optional secret file mount, and have kubelet auto-load changes to the mounted paths?
There was a problem hiding this comment.
I agree it is a better approach, even though it will cause a restart of the container. Based on my reading the kubelet may take up to a minute to trigger the restart, so for the first little bit after deploying the operator you would not have metrics, but I think for overall simplicity it is better, will do. Thanks.
| // dynamicCertLoader dynamically loads TLS certificates from Kubernetes secrets | ||
| // for the metrics endpoint. The monitor controller creates the server cert, and | ||
| // the client CA is loaded from the Prometheus client TLS secret. | ||
| type dynamicCertLoader struct { |
There was a problem hiding this comment.
I agree it is a better approach, even though it will cause a restart of the container. Based on my reading the kubelet may take up to a minute to trigger the restart, so for the first little bit after deploying the operator you would not have metrics, but I think for overall simplicity it is better, will do. Thanks.
Add operator metrics endpoint with configurable mTLS via METRICS_SCHEME, METRICS_HOST, and METRICS_PORT env vars. The monitor controller creates a server cert, Service, and ServiceMonitor for Prometheus integration. Client auth trusts the tigera-ca-private CA rather than individual leaf certs. Includes a custom Prometheus collector for operator status gauges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the implicit metrics-enabled detection (METRICS_HOST/PORT set) with an explicit METRICS_ENABLED=true env var. Default METRICS_HOST to 0.0.0.0 and METRICS_PORT to 8484 when enabled. Log a helpful message when mTLS is enabled but the server certificate is not yet available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use port 9484 instead of 8484 to reduce the chance of conflicts on host-networked nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set SecureServing: true so controller-runtime actually serves TLS instead of plain HTTP when METRICS_SCHEME=https. Add egress rule in calicoSystemPrometheusPolicy to allow Prometheus to reach the operator metrics Service. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gate the tigera-operator-tls keypair creation on METRICS_SCHEME=https rather than METRICS_ENABLED=true. The Service and ServiceMonitor are still created whenever metrics are enabled, but the TLS certificate is only needed when mTLS is active. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add METRICS_CLIENT_AUTH and TLS_MIN_VERSION env vars for configurable mTLS client auth and TLS version (aligned with calico-private) - Extract metrics helpers into cmd/metrics.go - Replace dynamic cert loader with file-based TLS loading via volume mounts - Move MetricsEnabled/MetricsTLSEnabled to pkg/common for shared use - Export certificate annotation constants from certificatemanagement package - Rename OperatorMetricsSecretName to OperatorTLSSecretName - Fix inconsistent method naming (serviceOperatorMetrics) - Simplify conditionLabel to strings.ToLower - Add debug log for unparseable cert expiry annotations - Skip license API calls on OSS clusters (check CRD at startup) - Remove legacy "team: network-operators" label from all ServiceMonitors - Build ./cmd/ package instead of single file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move MetricsEnabled/MetricsTLSEnabled to pkg/common/metrics.go - Move ParseTLSVersion and ParseClientAuthType to pkg/tls/tls.go - Fix stale comment on license error path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove duplicate prometheus/client_golang dependency entry after rebasing on latest master. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
METRICS_HOST,METRICS_PORT, andMETRICS_SCHEMEenv varsMETRICS_SCHEME=https: server cert fromtigera-operator-tls, client auth truststigera-ca-privateCAoperator_installation_statusandoperator_tigera_statusgaugesTest plan
make buildpassesmake ut UT_DIR=./pkg/controller/metrics— 10 tests passmake ut UT_DIR=./pkg/render/monitor— 20 tests passmake ut UT_DIR=./pkg/controller/monitor— 17 tests passMETRICS_HOST/METRICS_PORTset, verify metrics scrapedMETRICS_SCHEME=https, verify mTLS handshake with Prometheus🤖 Generated with Claude Code
Example alerts:

Example metrics: