@@ -18,11 +18,14 @@ tracking-link:
1818
1919## Terms
2020
21- monitors - refers to the CRDs ServiceMonitor, PodMonitor and Probe from Prometheus Operator;
21+ monitors - refers to the CRDs ServiceMonitor, PodMonitor and Probe from
22+ Prometheus Operator;
2223
23- users - refers to end-users of OpenShift who manage an OpenShift installation i.e cluster-admins;
24+ users - refers to end-users of OpenShift who manage an OpenShift installation
25+ i.e cluster-admins;
2426
25- developers - refers to OpenShift developers that build the platform i.e. RedHat associates and OpenSource contributors;
27+ developers - refers to OpenShift developers that build the platform i.e. RedHat
28+ associates and OpenSource contributors;
2629
2730
2831## Summary
@@ -48,14 +51,14 @@ Nevertheless, users have repeatedly asked for the ability to reduce the amount
4851of memory consumed by Prometheus either by lowering the Prometheus scrape
4952intervals or by modifying monitors.
5053
51- Users currently can not control the aforementioned monitors scraped by Prometheus
52- since some of the metrics collected are essential for other parts of the system
53- to function properly: recording rules, alerting rules, console dashboards, and
54- Red Hat Telemetry. Users also are not allowed to tune the interval at which
55- Prometheus scrapes targets as this again can have unforeseen results that can
56- hinder the platform: a low scrape interval value may overwhelm the platform
57- Prometheus instance while a high interval value may render some of the default
58- alerts ineffective.
54+ Users currently can not control the aforementioned monitors scraped by
55+ Prometheus since some of the metrics collected are essential for other parts of
56+ the system to function properly: recording rules, alerting rules, console
57+ dashboards, and Red Hat Telemetry. Users also are not allowed to tune the
58+ interval at which Prometheus scrapes targets as this again can have unforeseen
59+ results that can hinder the platform: a low scrape interval value may overwhelm
60+ the platform Prometheus instance while a high interval value may render some of
61+ the default alerts ineffective.
5962
6063The goal of this proposal is to allow users to pick their desired level of
6164scraping while limiting the impact this might have on the platform, via
@@ -90,17 +93,17 @@ kubelet and the network daemon.
9093- As a developer, I want a supported way to collect a subset of the metrics
9194 exported by my operator and operands, while still collecting necessary metrics
9295 for alerts, visualization of key indicators and Telemetry.
93- - As a developer of a component (that does not yet implement a profile), I want to
94- extract metrics needed to implement said profile, based on the assets I
96+ - As a developer of a component (that does not yet implement a profile), I want
97+ to extract metrics needed to implement said profile, based on the assets I
9598 provide, or the ones gathered from the cluster based on a group of target
9699 selectors, and a plug-in relabel configuration to apply within the monitor.
97- - As a developer of a component (that does not, or only partially implements a profile),
98- I want to get information about any monitors that are not yet implemented for
99- any of the supported profiles that are offered.
100- - As a developer of a component (that implements a profile), I want to verify if all the
101- profile metrics are present in the cluster, and which of the profile monitors
102- are affected if not. Also, I want additional information to narrow down where
103- these metrics are exactly being used.
100+ - As a developer of a component (that does not, or only partially implements a
101+ profile), I want to get information about any monitors that are not yet
102+ implemented for any of the supported profiles that are offered.
103+ - As a developer of a component (that implements a profile), I want to verify if
104+ all the profile metrics are present in the cluster, and which of the profile
105+ monitors are affected if not. Also, I want additional information to narrow
106+ down where these metrics are exactly being used.
104107
105108### Goals
106109
@@ -325,7 +328,8 @@ not. To aid teams with this effort the monitoring team will provide:
325328 resource (Alerts/PrometheusRules/Dashboards) using a metric which is not
326329 present in the monitor in question;
327330
328- - What happens if a user provides an invalid value for a metrics collection profile?
331+ - What happens if a user provides an invalid value for a metrics collection
332+ profile?
329333 - CMO will reconcile and validate that the value supplied is invalid and it
330334 will report Degraded=False and fail reconciliation.
331335
@@ -352,8 +356,8 @@ not. To aid teams with this effort the monitoring team will provide:
352356- Unit tests in CMO to validate that the correct monitors are being selected
353357- E2E tests in CMO to validate that everything works correctly
354358- For the ` minimal ` profile, origin/CI test to validate that every metric used
355- in a resource (Alerts/PrometheusRules/Dashboards) exists in the ` keep ` expression
356- of a minimal monitors.
359+ in a resource (Alerts/PrometheusRules/Dashboards) exists in the ` keep `
360+ expression of a minimal monitors.
357361
358362### Graduation Criteria
359363
@@ -450,10 +454,11 @@ Initial proofs-of-concept:
450454 - [ Azure
451455 Docs] ( https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/prometheus-metrics-scrape-configuration-minimal )
452456 - https://github.com/Azure/prometheus-collector
453- - In their approach they also have [ hardcoded] ( https://github.com/Azure/prometheus-collector/blob/66ed1a5a27781d7e7e3bb1771b11f1da25ffa79c/otelcollector/configmapparser/tomlparser-default-targets-metrics-keep-list.rb#L28 )
457+ - In their approach they also have
458+ [ hardcoded] ( https://github.com/Azure/prometheus-collector/blob/66ed1a5a27781d7e7e3bb1771b11f1da25ffa79c/otelcollector/configmapparser/tomlparser-default-targets-metrics-keep-list.rb#L28 )
454459 set of metrics that are only consumed when the minimal profile is enabled.
455- However, customers are also able to extend this minimal profile with regexes to
456- include metrics which might be interesting to them.
460+ However, customers are also able to extend this minimal profile with regexes
461+ to include metrics which might be interesting to them.
457462- Leverage [ installer capabilities] ( https://docs.google.com/document/d/1I-YT7LKKDHSBLB6Hmg0tZ54DWjrAxlVdXxlViShMu-0/edit#heading=h.848jsje80fru )
458463 - After some consideration we decided to abandon this idea since it would only
459464 work for resources controlled by CVO which is not the case for the majority
0 commit comments