22title : metrics-collection-profiles
33authors :
44 - JoaoBraveCoding
5+ - rexagod
56reviewers :
67 - openshift/openshift-team-monitoring
78approvers :
8- - TBD
9- api-approvers : " None"
9+ - " @openshift/openshift-team-monitoring"
10+ api-approvers :
11+ - " @dgrisonnet"
1012creation-date : 2022-12-06
11- last-updated : 2023-07-24
13+ last-updated : 2025-05-06
1214tracking-link :
1315 - https://issues.redhat.com/browse/MON-2483
1416 - https://issues.redhat.com/browse/MON-3043
17+ - https://issues.redhat.com/browse/MON-3808
1518---
1619
1720# Metrics collection profiles
1821
1922## Terms
2023
21- monitors - refers to the CRDs ServiceMonitor, PodMonitor and Probe from
22- Prometheus Operator;
23-
24- users - refers to end-users of OpenShift who manage an OpenShift installation
25- i.e cluster-admins;
26-
27- developers - refers to OpenShift developers that build the platform i.e. RedHat
28- associates and OpenSource contributors;
29-
24+ - Monitors: Refers to the CRDs ServiceMonitor, PodMonitor and Probe from
25+ Prometheus Operator.
26+ - Users: Refers to end-users of OpenShift who manage an OpenShift installation
27+ i.e cluster-admins.
28+ - Developers: Refers to OpenShift developers that build the platform i.e. RedHat
29+ associates and OpenSource contributors.
3030
3131## Summary
3232
3333The core OpenShift components ship a large number of metrics. A 4.12-nightly
3434cluster on AWS (3 control plane nodes + 3 worker nodes) currently produces
35- around 350,000 unique timeseries , and adding optional operators increases that
35+ around 350,000 unique time-series , and adding optional operators increases that
3636number. Users have repeatedly asked for a supported method of making Prometheus
3737consume less memory and CPU, either by increasing the scraping interval or by
3838scraping fewer targets.
@@ -51,14 +51,14 @@ Nevertheless, users have repeatedly asked for the ability to reduce the amount
5151of memory consumed by Prometheus either by lowering the Prometheus scrape
5252intervals or by modifying monitors.
5353
54- Users currently can not control the aforementioned monitors scraped by
55- Prometheus since some of the metrics collected are essential for other parts of
56- the system to function properly: recording rules, alerting rules, console
57- dashboards, and Red Hat Telemetry. Users also are not allowed to tune the
58- interval at which Prometheus scrapes targets as this again can have unforeseen
59- results that can hinder the platform: a low scrape interval value may overwhelm
60- the platform Prometheus instance while a high interval value may render some of
61- the default alerts ineffective.
54+ Users currently cannot control the aforementioned monitors scraped by Prometheus
55+ since some of the metrics collected are essential for other parts of the system
56+ to function properly: recording rules, alerting rules, console dashboards, and
57+ Red Hat Telemetry. Users also are not allowed to tune the interval at which
58+ Prometheus scrapes targets as this again can have unforeseen results that can
59+ hinder the platform: a low scrape interval value may overwhelm the platform
60+ Prometheus instance while a high interval value may render some of the default
61+ alerts ineffective.
6262
6363The goal of this proposal is to allow users to pick their desired level of
6464scraping while limiting the impact this might have on the platform, via
@@ -80,10 +80,9 @@ the `minimal` to the `full` profile. More details can be consulted in this
8080
8181Moreover, through Telemetry, we collect for each cluster the top 3 Prometheus
8282jobs that generate the biggest amount of samples. With this data, we know that
83- for OpenShift 4.11 the 5 components most often reported as the biggest producers
84- are: the Kubernetes API servers, the Kubernetes schedulers, kube-state-metrics,
85- kubelet and the network daemon.
86-
83+ for recent OpenShift versions, the 5 components most often reported as the
84+ biggest producers are: the Kubernetes API servers, the Kubernetes schedulers,
85+ kube-state-metrics, kubelet and the network daemon.
8786
8887### User Stories
8988
@@ -179,7 +178,7 @@ The goal is to support 2 profiles:
179178
180179- ` full` (same as today)
181180- ` minimal` (only collect metrics necessary for recording rules, alerts,
182- dashboards, HPA and VPA and telemetry)
181+ dashboards, HPA, VPA and telemetry)
183182
184183When the cluster admin enables the `minimal` profile, the Prometheus
185184resource would be configured accordingly :
@@ -253,7 +252,7 @@ spec:
253252 ` ` `
254253
255254Note :
256- - the `metricRelabelings` section keeps only two metrics, while the rest is
255+ - the `metricRelabelings` section keeps only two metrics, while the rest are
257256dropped.
258257- the metrics in the `keep` section were obtained with the help of a script that
259258 parsed all Alerts, PrometheusRules and Console dashboards to determine what
@@ -316,17 +315,16 @@ not. To aid teams with this effort the monitoring team will provide:
316315 to utilize all aspects of this feature into their component's workflow.
317316- an origin/CI test that validates for all Alerts and PrometheusRules that the
318317 metrics used by them are present in the ` keep ` expression of the
319- monitor for the ` minimal ` profile
320-
318+ monitor for the ` minimal ` profile.
321319
322320### Risks and Mitigations
323321
324- - How are monitors supposed to be kept up to date? In 4.12 a metric that wasn't
325- being used in an alert is now required, how does the monitor responsible for
322+ - How are monitors supposed to be kept up to date? A metric that wasn't being
323+ used earlier in an alert is now required, how does the monitor responsible for
326324 that metric get updated?
327325 - The origin/CI test mentioned in the previous section will fail if there is a
328326 resource (Alerts/PrometheusRules/Dashboards) using a metric which is not
329- present in the monitor in question;
327+ present in the monitor in question.
330328
331329- What happens if a user provides an invalid value for a metrics collection
332330 profile?
@@ -337,29 +335,24 @@ not. To aid teams with this effort the monitoring team will provide:
337335 - Our current validation strategy with only two profiles is quite linear,
338336 however, things start becoming more complex and hard to maintain as we
339337 introduce new profiles to the mix.
340- - Some of the things to consider if new profiles are introduce are:
341- - How would we validate such profile?
342- - How would we ensure teams that adopted metrics collection profiles
343- implement the new profile?
344- - How would we aid developers implementing the new profile?
345338
346339### Drawbacks
347340
348- - Extra CI cycles
341+ - Extra CI cycles.
349342
350343## Design Details
351344
352345### Open Questions
353346
354- ### Test Plan
347+ ## Test Plan
355348
356- - Unit tests in CMO to validate that the correct monitors are being selected
357- - E2E tests in CMO to validate that everything works correctly
349+ - Unit tests in CMO to validate that the correct monitors are being selected.
350+ - E2E tests in CMO to validate that everything works correctly.
358351- For the ` minimal ` profile, origin/CI test to validate that every metric used
359352in a resource (Alerts/PrometheusRules/Dashboards) exists in the ` keep `
360353expression of a minimal monitors.
361354
362- ### Graduation Criteria
355+ ## Graduation Criteria
363356
364357- Released as TechPreview: the default being ` full ` , it
365358shouldn't impact operations.
@@ -368,7 +361,13 @@ shouldn't impact operations.
368361profile out-of-the-box and removes the earlier-imposed
369362TechPreview gate. PTAL at the section below for more details.
370363
371- #### Tech Preview -> GA
364+ - GA'd in 4.19: https://github.com/openshift/api/pull/2286
365+
366+ ### Dev Preview -> Tech Preview
367+
368+ - [ Design scrape profiles in CMO] ( https://issues.redhat.com/browse/MON-2483 )
369+
370+ ### Tech Preview -> GA
372371
373372- [ Automation to update metrics in collection profiles] ( https://issues.redhat.com/browse/MON-3106 )
374373- [ Telemetry signal for collection profile usage] ( https://issues.redhat.com/browse/MON-3231 )
@@ -378,12 +377,17 @@ TechPreview gate. PTAL at the section below for more details.
378377- [ origin/CI tool to validate collection profiles] ( https://issues.redhat.com/browse/MON-3105 )
379378- [ User facing documentation created in OpenShift-docs] ( https://issues.redhat.com/browse/OBSDOCS-330 )
380379
381- #### Removing a deprecated feature
380+ ### Removing a deprecated feature
382381
383- - Announce deprecation and support policy of the existing feature
384- - Deprecate the feature
382+ Deprecation, in the scope of collection profiles, is unlikely as that would
383+ entail moving all existing ServiceMonitors in that profile to either accomodate
384+ themselves in other profiles, or simply not exist anymore, which will need to be
385+ done for all teams. Either of these measures will require teams to rollback
386+ drastically on behaviours they built around in the first place. As it is right
387+ now, we do not plan on deprecating the exposed ` full ` or ` minimal ` profiles at
388+ all.
385389
386- ### Upgrade / Downgrade Strategy
390+ ## Upgrade / Downgrade Strategy
387391
388392- If metrics collection profiles is accepted and is released to 4.13 then we
389393 must backport the new monitors selectors to 4.12. The reason being that when
@@ -395,29 +399,14 @@ TechPreview gate. PTAL at the section below for more details.
395399- Once we backport the new monitors selectors upgrades and
396400 downgrades are not expected to present a significant challenge.
397401
398- ### Version Skew Strategy
399-
400- TBD but I don't think it applies here
401-
402- ### minimal Aspects of API Extensions
403-
404- TBD but I don't think it applies here
405-
406- #### Failure Modes
407-
408- TBD but I don't think it applies here
409-
410- #### Support Procedures
411-
412- TBD but I don't think it applies here
413-
414402## Implementation History
415403
416- Initial proofs-of-concept:
417-
418404- https://github.com/openshift/cluster-monitoring-operator/pull/1785
405+ - https://github.com/openshift/cluster-monitoring-operator/pull/2030
406+ - https://github.com/openshift/cluster-monitoring-operator/pull/2047
407+ - https://github.com/openshift/origin/pull/28889
419408
420- ## Alternatives
409+ ## Alternatives (Not Implemented)
421410
422411- Make CMO injecting metric relabelling for all service monitors based on the
423412 rules being deployed, but this is not a good idea because:
@@ -453,18 +442,21 @@ Initial proofs-of-concept:
453442- Recently Azure also added support for metrics collection profiles:
454443 - [ Azure
455444 Docs] ( https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/prometheus-metrics-scrape-configuration-minimal )
456- - https://github.com/Azure/prometheus-collector
445+ - https://github.com/Azure/prometheus-collector
457446 - In their approach they also have
458- [ hardcoded] ( https://github.com/Azure/prometheus-collector/blob/66ed1a5a27781d7e7e3bb1771b11f1da25ffa79c/otelcollector/configmapparser/tomlparser-default-targets-metrics-keep-list.rb#L28 )
459- set of metrics that are only consumed when the minimal profile is enabled.
460- However, customers are also able to extend this minimal profile with regexes
461- to include metrics which might be interesting to them.
462- - Leverage [ installer capabilities] ( https://docs.google.com/document/d/1I-YT7LKKDHSBLB6Hmg0tZ54DWjrAxlVdXxlViShMu-0/edit#heading=h.848jsje80fru )
447+ [ hardcoded] ( https://github.com/Azure/prometheus-collector/blob/66ed1a5a27781d7e7e3bb1771b11f1da25ffa79c/otelcollector/configmapparser/tomlparser-default-targets-metrics-keep-list.rb#L28 )
448+ set of metrics that are only consumed when the minimal profile is enabled.
449+ However, customers are also able to extend this minimal profile with regexes
450+ to include metrics which might be interesting to them.
451+ - Leverage [ installer
452+ capabilities] ( https://docs.google.com/document/d/1I-YT7LKKDHSBLB6Hmg0tZ54DWjrAxlVdXxlViShMu-0/edit#heading=h.848jsje80fru )
463453 - After some consideration we decided to abandon this idea since it would only
464454 work for resources controlled by CVO which is not the case for the majority
465455 of ServiceMonitors.
466-
467- ## Infrastructure Needed [ optional]
456+ - Additionally, this requires users to "commit" to one profile throughout the
457+ cluster lifecycle, which is a bit static for our needs, for more details,
458+ PTAL at [ the reasoning
459+ here] ( https://github.com/openshift/enhancements/pull/1298#issuecomment-1895513051 ) .
468460
469461### Adopted metrics collection profiles
470462
@@ -481,3 +473,37 @@ and implementation status. Possible implementation status:
481473| Monitoring Team | kube-state-metrics | Implemented |
482474| Monitoring Team | node-exporter | Implemented |
483475| Monitoring Team | prometheus-adapter | Implemented |
476+
477+ ### Topology Considerations
478+
479+ Supported on all topologies that deploy CMO.
480+
481+ #### Hypershift / Hosted Control Planes
482+
483+ N/A
484+
485+ #### Standalone Clusters
486+
487+ N/A
488+
489+ #### Single-node Deployments or MicroShift
490+
491+ N/A
492+
493+ ## Support Procedures
494+
495+ - The ` full ` collection profile is meant to be synonymous with the exhibited
496+ behavior before the introduction of this patch, as such, users can switch to it
497+ if other profiles are not working as expected.
498+ - The aforementioned utilities (for eg., CPV) can be used to help diagnose the
499+ issue further.
500+
501+ ## Version Skew Strategy
502+
503+ The feature-set depends on Prometheus-operator as the provider component for
504+ the resources it works on, which is shipped with CMO. No version skew is
505+ expected.
506+
507+ ## Operational Aspects of API Extensions
508+
509+ N/A
0 commit comments