22title : metrics-collection-profiles
33authors :
44 - JoaoBraveCoding
5+ - rexagod
56reviewers :
67 - openshift/openshift-team-monitoring
78approvers :
89 - TBD
910api-approvers : " None"
1011creation-date : 2022-12-06
11- last-updated : 2023-07-24
12+ last-updated : 2025-05-06
1213tracking-link :
1314 - https://issues.redhat.com/browse/MON-2483
1415 - https://issues.redhat.com/browse/MON-3043
16+ - https://issues.redhat.com/browse/MON-3808
1517---
1618
1719# Metrics collection profiles
1820
1921## Terms
2022
21- monitors - refers to the CRDs ServiceMonitor, PodMonitor and Probe from
22- Prometheus Operator;
23-
24- users - refers to end-users of OpenShift who manage an OpenShift installation
25- i.e cluster-admins;
26-
27- developers - refers to OpenShift developers that build the platform i.e. RedHat
28- associates and OpenSource contributors;
29-
23+ - Monitors: Refers to the CRDs ServiceMonitor, PodMonitor and Probe from
24+ Prometheus Operator.
25+ - Users: Refers to end-users of OpenShift who manage an OpenShift installation
26+ i.e cluster-admins.
27+ - Developers: Refers to OpenShift developers that build the platform i.e. RedHat
28+ associates and OpenSource contributors.
3029
3130## Summary
3231
3332The core OpenShift components ship a large number of metrics. A 4.12-nightly
3433cluster on AWS (3 control plane nodes + 3 worker nodes) currently produces
35- around 350,000 unique timeseries , and adding optional operators increases that
34+ around 350,000 unique time-series , and adding optional operators increases that
3635number. Users have repeatedly asked for a supported method of making Prometheus
3736consume less memory and CPU, either by increasing the scraping interval or by
3837scraping fewer targets.
@@ -51,14 +50,14 @@ Nevertheless, users have repeatedly asked for the ability to reduce the amount
5150of memory consumed by Prometheus either by lowering the Prometheus scrape
5251intervals or by modifying monitors.
5352
54- Users currently can not control the aforementioned monitors scraped by
55- Prometheus since some of the metrics collected are essential for other parts of
56- the system to function properly: recording rules, alerting rules, console
57- dashboards, and Red Hat Telemetry. Users also are not allowed to tune the
58- interval at which Prometheus scrapes targets as this again can have unforeseen
59- results that can hinder the platform: a low scrape interval value may overwhelm
60- the platform Prometheus instance while a high interval value may render some of
61- the default alerts ineffective.
53+ Users currently cannot control the aforementioned monitors scraped by Prometheus
54+ since some of the metrics collected are essential for other parts of the system
55+ to function properly: recording rules, alerting rules, console dashboards, and
56+ Red Hat Telemetry. Users also are not allowed to tune the interval at which
57+ Prometheus scrapes targets as this again can have unforeseen results that can
58+ hinder the platform: a low scrape interval value may overwhelm the platform
59+ Prometheus instance while a high interval value may render some of the default
60+ alerts ineffective.
6261
6362The goal of this proposal is to allow users to pick their desired level of
6463scraping while limiting the impact this might have on the platform, via
@@ -80,10 +79,9 @@ the `minimal` to the `full` profile. More details can be consulted in this
8079
8180Moreover, through Telemetry, we collect for each cluster the top 3 Prometheus
8281jobs that generate the biggest amount of samples. With this data, we know that
83- for OpenShift 4.11 the 5 components most often reported as the biggest producers
84- are: the Kubernetes API servers, the Kubernetes schedulers, kube-state-metrics,
85- kubelet and the network daemon.
86-
82+ for recent OpenShift versions, the 5 components most often reported as the
83+ biggest producers are: the Kubernetes API servers, the Kubernetes schedulers,
84+ kube-state-metrics, kubelet and the network daemon.
8785
8886### User Stories
8987
@@ -179,7 +177,7 @@ The goal is to support 2 profiles:
179177
180178- ` full` (same as today)
181179- ` minimal` (only collect metrics necessary for recording rules, alerts,
182- dashboards, HPA and VPA and telemetry)
180+ dashboards, HPA, VPA and telemetry)
183181
184182When the cluster admin enables the `minimal` profile, the Prometheus
185183resource would be configured accordingly :
@@ -253,7 +251,7 @@ spec:
253251 ` ` `
254252
255253Note :
256- - the `metricRelabelings` section keeps only two metrics, while the rest is
254+ - the `metricRelabelings` section keeps only two metrics, while the rest are
257255dropped.
258256- the metrics in the `keep` section were obtained with the help of a script that
259257 parsed all Alerts, PrometheusRules and Console dashboards to determine what
@@ -316,17 +314,16 @@ not. To aid teams with this effort the monitoring team will provide:
316314 to utilize all aspects of this feature into their component's workflow.
317315- an origin/CI test that validates for all Alerts and PrometheusRules that the
318316 metrics used by them are present in the ` keep ` expression of the
319- monitor for the ` minimal ` profile
320-
317+ monitor for the ` minimal ` profile.
321318
322319### Risks and Mitigations
323320
324- - How are monitors supposed to be kept up to date? In 4.12 a metric that wasn't
325- being used in an alert is now required, how does the monitor responsible for
321+ - How are monitors supposed to be kept up to date? A metric that wasn't being
322+ used earlier in an alert is now required, how does the monitor responsible for
326323 that metric get updated?
327324 - The origin/CI test mentioned in the previous section will fail if there is a
328325 resource (Alerts/PrometheusRules/Dashboards) using a metric which is not
329- present in the monitor in question;
326+ present in the monitor in question.
330327
331328- What happens if a user provides an invalid value for a metrics collection
332329 profile?
@@ -337,24 +334,19 @@ not. To aid teams with this effort the monitoring team will provide:
337334 - Our current validation strategy with only two profiles is quite linear,
338335 however, things start becoming more complex and hard to maintain as we
339336 introduce new profiles to the mix.
340- - Some of the things to consider if new profiles are introduce are:
341- - How would we validate such profile?
342- - How would we ensure teams that adopted metrics collection profiles
343- implement the new profile?
344- - How would we aid developers implementing the new profile?
345337
346338### Drawbacks
347339
348- - Extra CI cycles
340+ - Extra CI cycles.
349341
350342## Design Details
351343
352344### Open Questions
353345
354346### Test Plan
355347
356- - Unit tests in CMO to validate that the correct monitors are being selected
357- - E2E tests in CMO to validate that everything works correctly
348+ - Unit tests in CMO to validate that the correct monitors are being selected.
349+ - E2E tests in CMO to validate that everything works correctly.
358350- For the ` minimal ` profile, origin/CI test to validate that every metric used
359351in a resource (Alerts/PrometheusRules/Dashboards) exists in the ` keep `
360352expression of a minimal monitors.
@@ -368,6 +360,8 @@ shouldn't impact operations.
368360profile out-of-the-box and removes the earlier-imposed
369361TechPreview gate. PTAL at the section below for more details.
370362
363+ - GA'd in 4.19: https://github.com/openshift/api/pull/2286
364+
371365#### Tech Preview -> GA
372366
373367- [ Automation to update metrics in collection profiles] ( https://issues.redhat.com/browse/MON-3106 )
@@ -380,8 +374,13 @@ TechPreview gate. PTAL at the section below for more details.
380374
381375#### Removing a deprecated feature
382376
383- - Announce deprecation and support policy of the existing feature
384- - Deprecate the feature
377+ Deprecation, in the scope of collection profiles, is unlikely as that would
378+ entail moving all existing ServiceMonitors in that profile to either accomodate
379+ themselves in other profiles, or simply not exist anymore, which will need to be
380+ done for all teams. Either of these measures will require teams to rollback
381+ drastically on behaviours they built around in the first place. As it is right
382+ now, we do not plan on deprecating the exposed ` full ` or ` minimal ` profiles at
383+ all.
385384
386385### Upgrade / Downgrade Strategy
387386
@@ -395,27 +394,12 @@ TechPreview gate. PTAL at the section below for more details.
395394- Once we backport the new monitors selectors upgrades and
396395 downgrades are not expected to present a significant challenge.
397396
398- ### Version Skew Strategy
399-
400- TBD but I don't think it applies here
401-
402- ### minimal Aspects of API Extensions
403-
404- TBD but I don't think it applies here
405-
406- #### Failure Modes
407-
408- TBD but I don't think it applies here
409-
410- #### Support Procedures
411-
412- TBD but I don't think it applies here
413-
414397## Implementation History
415398
416- Initial proofs-of-concept:
417-
418399- https://github.com/openshift/cluster-monitoring-operator/pull/1785
400+ - https://github.com/openshift/cluster-monitoring-operator/pull/2030
401+ - https://github.com/openshift/cluster-monitoring-operator/pull/2047
402+ - https://github.com/openshift/origin/pull/28889
419403
420404## Alternatives
421405
@@ -463,8 +447,9 @@ Initial proofs-of-concept:
463447 - After some consideration we decided to abandon this idea since it would only
464448 work for resources controlled by CVO which is not the case for the majority
465449 of ServiceMonitors.
466-
467- ## Infrastructure Needed [ optional]
450+ - Additionally, this requires users to "commit" to one profile throughout the
451+ cluster lifecycle, which is a bit static for our needs, for more details,
452+ PTAL at [ the reasoning here] ( https://github.com/openshift/enhancements/pull/1298#issuecomment-1895513051 ) .
468453
469454### Adopted metrics collection profiles
470455
0 commit comments