Skip to content

feat(monitor): operator Prometheus metrics with mTLS#4558

Merged
rene-dekker merged 8 commits intotigera:masterfrom
rene-dekker:EV-6493
Apr 7, 2026
Merged

feat(monitor): operator Prometheus metrics with mTLS#4558
rene-dekker merged 8 commits intotigera:masterfrom
rene-dekker:EV-6493

Conversation

@rene-dekker
Copy link
Copy Markdown
Member

@rene-dekker rene-dekker commented Mar 16, 2026

Summary

  • Add configurable Prometheus metrics endpoint to the operator via METRICS_HOST, METRICS_PORT, and METRICS_SCHEME env vars
  • mTLS support when METRICS_SCHEME=https: server cert from tigera-operator-tls, client auth trusts tigera-ca-private CA
  • Monitor controller creates Service, ServiceMonitor, and server TLS cert for automatic Prometheus discovery
  • Custom Prometheus collector exposes operator_installation_status and operator_tigera_status gauges

Test plan

  • make build passes
  • make ut UT_DIR=./pkg/controller/metrics — 10 tests pass
  • make ut UT_DIR=./pkg/render/monitor — 20 tests pass
  • make ut UT_DIR=./pkg/controller/monitor — 17 tests pass
  • Manual: deploy with METRICS_HOST/METRICS_PORT set, verify metrics scraped
  • Manual: deploy with METRICS_SCHEME=https, verify mTLS handshake with Prometheus

🤖 Generated with Claude Code

Example alerts:
image

Example metrics:

$ curl --cert client.crt --key client.key --cacert ca.crt https://tigera-operator-metrics.tigera-operator:9484/metrics | grep tigera_operator_tls_certificate
tigera_operator_tls_certificate_expiry_timestamp_seconds Unix timestamp of certificate expiry for operator-managed TLS secrets.
# TYPE tigera_operator_tls_certificate_expiry_timestamp_seconds gauge
tigera_operator_tls_certificate_expiry_timestamp_seconds{issuer="byo-signer",name="calico-apiserver-certs",namespace="calico-system"} 1.774828114e+09
tigera_operator_tls_certificate_expiry_timestamp_seconds{issuer="tigera-operator-signer",name="calico-apiserver-certs",namespace="tigera-operator"} 1.844609326e+09

$ curl --cert client.crt --key client.key --cacert ca.crt https://tigera-operator-metrics.tigera-operator:9484/metrics | grep tigera_operator_component_status
tigera_operator_component_status{component="apiserver",condition="available"} 1
tigera_operator_component_status{component="apiserver",condition="degraded"} 0
tigera_operator_component_status{component="apiserver",condition="progressing"} 0
tigera_operator_component_status{component="calico",condition="available"} 1
tigera_operator_component_status{component="calico",condition="degraded"} 0
tigera_operator_component_status{component="calico",condition="progressing"} 0


$ curl --cert client.crt --key client.key --cacert ca.crt https://tigera-operator-metrics.tigera-operator:9484/metrics | grep tigera_operator_license
tigera_operator_license_expiry_timestamp_seconds{package="Enterprise"} 2.051337599e+09
# HELP tigera_operator_license_valid Whether the Tigera license is valid (including grace period). 1 = valid, 0 = invalid.
tigera_operator_license_valid{package="Enterprise"} 1
Adds new prometheus metrics to the operator for managing TLS expiry, License expiry and TigeraStatus monitoring. When METRICS_SCHEME is set to https, the operator will create its own TLS secret, which you can replace with your own, as is possible for all TLS secrets in the operator namespace.
Breaking: To enable metrics, you now need to set the METRICS_ENABLED environment variable to true.

Comment thread pkg/render/monitor/monitor.go Outdated
rules = append(rules,
monitoringv1.Rule{
Alert: "TLSCertExpiringWarning",
Expr: intstr.FromString("tigera_operator_tls_certificate_expiry_timestamp_seconds - time() < 29 * 24 * 3600"),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked 29 days, because we automatically renew our own certs 30d before expiry. I wanted to exclude that day to generate fewer alerts.

@danudey danudey modified the milestones: v1.42.0, v1.43.0 Mar 20, 2026
@rene-dekker rene-dekker marked this pull request as ready for review March 27, 2026 18:19
@rene-dekker rene-dekker requested a review from a team as a code owner March 27, 2026 18:19
Comment thread cmd/main.go Outdated
Comment thread cmd/main.go Outdated

// metricsEnabled returns true when the operator metrics endpoint is enabled.
func metricsEnabled() bool {
return strings.EqualFold(os.Getenv("METRICS_ENABLED"), "true")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a breaking change, since someone might be using metrics today without this set?

But probably not a big deal - easy fix.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my rationale was that it is better to be explicit in the enablement, it can be reverted if desired. I would not think that there are that many users of prometheus as there were no metrics other than the out-of-the-box ones before this PR.

Comment thread cmd/main.go Outdated
Comment thread cmd/main.go Outdated
// dynamicCertLoader dynamically loads TLS certificates from Kubernetes secrets
// for the metrics endpoint. The monitor controller creates the server cert, and
// the client CA is loaded from the Prometheus client TLS secret.
type dynamicCertLoader struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, do we need all of this complexity?

Couldn't we just use an optional secret file mount, and have kubelet auto-load changes to the mounted paths?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is a better approach, even though it will cause a restart of the container. Based on my reading the kubelet may take up to a minute to trigger the restart, so for the first little bit after deploying the operator you would not have metrics, but I think for overall simplicity it is better, will do. Thanks.

Comment thread pkg/controller/metrics/collectors.go Outdated
Comment thread pkg/controller/metrics/collectors.go
Comment thread pkg/controller/metrics/collectors.go Outdated
Comment thread pkg/controller/metrics/collectors.go
Comment thread pkg/controller/monitor/monitor_controller.go Outdated
Comment thread pkg/render/monitor/monitor.go Outdated
Comment thread pkg/render/monitor/monitor.go
Comment thread pkg/render/monitor/monitor.go Outdated
Comment thread cmd/main.go Outdated
Comment thread cmd/main.go Outdated
// dynamicCertLoader dynamically loads TLS certificates from Kubernetes secrets
// for the metrics endpoint. The monitor controller creates the server cert, and
// the client CA is loaded from the Prometheus client TLS secret.
type dynamicCertLoader struct {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is a better approach, even though it will cause a restart of the container. Based on my reading the kubelet may take up to a minute to trigger the restart, so for the first little bit after deploying the operator you would not have metrics, but I think for overall simplicity it is better, will do. Thanks.

Comment thread cmd/metrics.go Outdated
Comment thread cmd/metrics.go Outdated
Comment thread pkg/controller/metrics/collectors.go
Comment thread pkg/common/operator_namespace.go Outdated
Comment thread cmd/metrics.go Outdated
Comment thread pkg/controller/metrics/collectors.go Outdated
rene-dekker and others added 4 commits April 7, 2026 15:37
Add operator metrics endpoint with configurable mTLS via METRICS_SCHEME,
METRICS_HOST, and METRICS_PORT env vars. The monitor controller creates
a server cert, Service, and ServiceMonitor for Prometheus integration.
Client auth trusts the tigera-ca-private CA rather than individual leaf
certs. Includes a custom Prometheus collector for operator status gauges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the implicit metrics-enabled detection (METRICS_HOST/PORT set)
with an explicit METRICS_ENABLED=true env var. Default METRICS_HOST to
0.0.0.0 and METRICS_PORT to 8484 when enabled. Log a helpful message
when mTLS is enabled but the server certificate is not yet available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use port 9484 instead of 8484 to reduce the chance of conflicts on
host-networked nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set SecureServing: true so controller-runtime actually serves TLS
instead of plain HTTP when METRICS_SCHEME=https. Add egress rule in
calicoSystemPrometheusPolicy to allow Prometheus to reach the operator
metrics Service.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rene-dekker and others added 4 commits April 7, 2026 15:37
Gate the tigera-operator-tls keypair creation on METRICS_SCHEME=https
rather than METRICS_ENABLED=true. The Service and ServiceMonitor are
still created whenever metrics are enabled, but the TLS certificate
is only needed when mTLS is active.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add METRICS_CLIENT_AUTH and TLS_MIN_VERSION env vars for configurable
  mTLS client auth and TLS version (aligned with calico-private)
- Extract metrics helpers into cmd/metrics.go
- Replace dynamic cert loader with file-based TLS loading via volume mounts
- Move MetricsEnabled/MetricsTLSEnabled to pkg/common for shared use
- Export certificate annotation constants from certificatemanagement package
- Rename OperatorMetricsSecretName to OperatorTLSSecretName
- Fix inconsistent method naming (serviceOperatorMetrics)
- Simplify conditionLabel to strings.ToLower
- Add debug log for unparseable cert expiry annotations
- Skip license API calls on OSS clusters (check CRD at startup)
- Remove legacy "team: network-operators" label from all ServiceMonitors
- Build ./cmd/ package instead of single file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move MetricsEnabled/MetricsTLSEnabled to pkg/common/metrics.go
- Move ParseTLSVersion and ParseClientAuthType to pkg/tls/tls.go
- Fix stale comment on license error path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove duplicate prometheus/client_golang dependency entry
after rebasing on latest master.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rene-dekker rene-dekker merged commit 2690b5e into tigera:master Apr 7, 2026
6 of 7 checks passed
@rene-dekker rene-dekker deleted the EV-6493 branch April 7, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants