MetricsCollectionProfiles: Reword and update KEP

rexagod · rexagod · commit 559e5467543d · 2025-05-06T22:35:26.000+05:30
Re-opening the KEP PR to backfill on the required proposal context.

Signed-off-by: Pranshu Srivastava &lt;rexagod@gmail.com&gt;
diff --git a/enhancements/monitoring/metrics-collection-profiles.md b/enhancements/monitoring/metrics-collection-profiles.md
@@ -2,37 +2,36 @@
 title: metrics-collection-profiles
 authors:
   - JoaoBraveCoding
+  - rexagod
 reviewers:
   - openshift/openshift-team-monitoring
 approvers:
   - TBD
 api-approvers: "None"
 creation-date: 2022-12-06
-last-updated: 2023-07-24
+last-updated: 2025-05-06
 tracking-link:
   - https://issues.redhat.com/browse/MON-2483
   - https://issues.redhat.com/browse/MON-3043
+  - https://issues.redhat.com/browse/MON-3808
 ---
 
 # Metrics collection profiles
 
 ## Terms
 
-monitors - refers to the CRDs ServiceMonitor, PodMonitor and Probe from
-Prometheus Operator;
-
-users - refers to end-users of OpenShift who manage an OpenShift installation
-i.e cluster-admins;
-
-developers - refers to OpenShift developers that build the platform i.e. RedHat
-associates and OpenSource contributors;
-
+- Monitors: Refers to the CRDs ServiceMonitor, PodMonitor and Probe from
+  Prometheus Operator.
+- Users: Refers to end-users of OpenShift who manage an OpenShift installation
+  i.e cluster-admins.
+- Developers: Refers to OpenShift developers that build the platform i.e. RedHat
+  associates and OpenSource contributors.
 
 ## Summary
 
 The core OpenShift components ship a large number of metrics. A 4.12-nightly
 cluster on AWS (3 control plane nodes + 3 worker nodes) currently produces
-around 350,000 unique timeseries, and adding optional operators increases that
+around 350,000 unique time-series, and adding optional operators increases that
 number. Users have repeatedly asked for a supported method of making Prometheus
 consume less memory and CPU, either by increasing the scraping interval or by
 scraping fewer targets.
@@ -51,14 +50,14 @@ Nevertheless, users have repeatedly asked for the ability to reduce the amount
 of memory consumed by Prometheus either by lowering the Prometheus scrape
 intervals or by modifying monitors.
 
-Users currently can not control the aforementioned monitors scraped by
-Prometheus since some of the metrics collected are essential for other parts of
-the system to function properly: recording rules, alerting rules, console
-dashboards, and Red Hat Telemetry. Users also are not allowed to tune the
-interval at which Prometheus scrapes targets as this again can have unforeseen
-results that can hinder the platform: a low scrape interval value may overwhelm
-the platform Prometheus instance while a high interval value may render some of
-the default alerts ineffective.
+Users currently cannot control the aforementioned monitors scraped by Prometheus
+since some of the metrics collected are essential for other parts of the system
+to function properly: recording rules, alerting rules, console dashboards, and
+Red Hat Telemetry. Users also are not allowed to tune the interval at which
+Prometheus scrapes targets as this again can have unforeseen results that can
+hinder the platform: a low scrape interval value may overwhelm the platform
+Prometheus instance while a high interval value may render some of the default
+alerts ineffective.
 
 The goal of this proposal is to allow users to pick their desired level of
 scraping while limiting the impact this might have on the platform, via
@@ -80,10 +79,9 @@ the `minimal` to the `full` profile. More details can be consulted in this
 
 Moreover, through Telemetry, we collect for each cluster the top 3 Prometheus
 jobs that generate the biggest amount of samples. With this data, we know that
-for OpenShift 4.11 the 5 components most often reported as the biggest producers
-are: the Kubernetes API servers, the Kubernetes schedulers, kube-state-metrics,
-kubelet and the network daemon.
-
+for recent OpenShift versions, the 5 components most often reported as the
+biggest producers are: the Kubernetes API servers, the Kubernetes schedulers,
+kube-state-metrics, kubelet and the network daemon.
 
 ### User Stories
 
@@ -179,7 +177,7 @@ The goal is to support 2 profiles:
 
 - `full` (same as today)
 - `minimal` (only collect metrics necessary for recording rules, alerts,
-  dashboards, HPA and VPA and telemetry)
+  dashboards, HPA, VPA and telemetry)
 
 When the cluster admin enables the `minimal` profile, the Prometheus
 resource would be configured accordingly:
@@ -253,7 +251,7 @@ spec:
  ```
 
 Note: 
-- the `metricRelabelings` section keeps only two metrics, while the rest is
+- the `metricRelabelings` section keeps only two metrics, while the rest are
 dropped.
 - the metrics in the `keep` section were obtained with the help of a script that
   parsed all Alerts, PrometheusRules and Console dashboards to determine what
@@ -316,17 +314,16 @@ not. To aid teams with this effort the monitoring team will provide:
   to utilize all aspects of this feature into their component's workflow.
 - an origin/CI test that validates for all Alerts and PrometheusRules that the
   metrics used by them are present in the `keep` expression of the
-  monitor for the `minimal` profile
-
+  monitor for the `minimal` profile.
 
 ### Risks and Mitigations
 
-- How are monitors supposed to be kept up to date? In 4.12 a metric that wasn't
-  being used in an alert is now required, how does the monitor responsible for
+- How are monitors supposed to be kept up to date? A metric that wasn't being
+  used earlier in an alert is now required, how does the monitor responsible for
   that metric get updated?
   - The origin/CI test mentioned in the previous section will fail if there is a
     resource (Alerts/PrometheusRules/Dashboards) using a metric which is not
-    present in the monitor in question;
+    present in the monitor in question.
 
 - What happens if a user provides an invalid value for a metrics collection
   profile?
@@ -337,24 +334,19 @@ not. To aid teams with this effort the monitoring team will provide:
   - Our current validation strategy with only two profiles is quite linear,
     however, things start becoming more complex and hard to maintain as we
     introduce new profiles to the mix. 
-  - Some of the things to consider if new profiles are introduce are:
-      - How would we validate such profile?
-      - How would we ensure teams that adopted metrics collection profiles
-        implement the new profile?
-      - How would we aid developers implementing the new profile?
 
 ### Drawbacks
 
-- Extra CI cycles
+- Extra CI cycles.
 
 ## Design Details
 
 ### Open Questions
 
 ### Test Plan
 
-- Unit tests in CMO to validate that the correct monitors are being selected
-- E2E tests in CMO to validate that everything works correctly 
+- Unit tests in CMO to validate that the correct monitors are being selected.
+- E2E tests in CMO to validate that everything works correctly.
 - For the `minimal` profile, origin/CI test to validate that every metric used
 in a resource (Alerts/PrometheusRules/Dashboards) exists in the `keep`
 expression of a minimal monitors.
@@ -368,6 +360,8 @@ shouldn't impact operations.
 profile out-of-the-box and removes the earlier-imposed
 TechPreview gate. PTAL at the section below for more details.
 
+- GA'd in 4.19: https://github.com/openshift/api/pull/2286
+
 #### Tech Preview -> GA
 
 - [Automation to update metrics in collection profiles](https://issues.redhat.com/browse/MON-3106)
@@ -380,8 +374,13 @@ TechPreview gate. PTAL at the section below for more details.
 
 #### Removing a deprecated feature
 
-- Announce deprecation and support policy of the existing feature
-- Deprecate the feature
+Deprecation, in the scope of collection profiles, is unlikely as that would
+entail moving all existing ServiceMonitors in that profile to either accomodate
+themselves in other profiles, or simply not exist anymore, which will need to be
+done for all teams. Either of these measures will require teams to rollback
+drastically on behaviours they built around in the first place. As it is right
+now, we do not plan on deprecating the exposed `full` or `minimal` profiles at
+all.
 
 ### Upgrade / Downgrade Strategy
 
@@ -395,27 +394,12 @@ TechPreview gate. PTAL at the section below for more details.
 - Once we backport the new monitors selectors upgrades and
   downgrades are not expected to present a significant challenge.
 
-### Version Skew Strategy
-
-TBD but I don't think it applies here
-
-### minimal Aspects of API Extensions
-
-TBD but I don't think it applies here
-
-#### Failure Modes
-
-TBD but I don't think it applies here
-
-#### Support Procedures
-
-TBD but I don't think it applies here
-
 ## Implementation History
 
-Initial proofs-of-concept:
-
 - https://github.com/openshift/cluster-monitoring-operator/pull/1785
+- https://github.com/openshift/cluster-monitoring-operator/pull/2030
+- https://github.com/openshift/cluster-monitoring-operator/pull/2047
+- https://github.com/openshift/origin/pull/28889
 
 ## Alternatives
 
@@ -463,8 +447,9 @@ Initial proofs-of-concept:
   - After some consideration we decided to abandon this idea since it would only
     work for resources controlled by CVO which is not the case for the majority
     of ServiceMonitors.
-
-## Infrastructure Needed [optional]
+  - Additionally, this requires users to "commit" to one profile throughout the
+    cluster lifecycle, which is a bit static for our needs, for more details,
+    PTAL at [the reasoning here](https://github.com/openshift/enhancements/pull/1298#issuecomment-1895513051).
 
 ### Adopted metrics collection profiles