Skip to content

Commit 559e546

Browse files
committed
MetricsCollectionProfiles: Reword and update KEP
Re-opening the KEP PR to backfill on the required proposal context. Signed-off-by: Pranshu Srivastava <[email protected]>
1 parent 895a089 commit 559e546

File tree

1 file changed

+45
-60
lines changed

1 file changed

+45
-60
lines changed

enhancements/monitoring/metrics-collection-profiles.md

Lines changed: 45 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -2,37 +2,36 @@
22
title: metrics-collection-profiles
33
authors:
44
- JoaoBraveCoding
5+
- rexagod
56
reviewers:
67
- openshift/openshift-team-monitoring
78
approvers:
89
- TBD
910
api-approvers: "None"
1011
creation-date: 2022-12-06
11-
last-updated: 2023-07-24
12+
last-updated: 2025-05-06
1213
tracking-link:
1314
- https://issues.redhat.com/browse/MON-2483
1415
- https://issues.redhat.com/browse/MON-3043
16+
- https://issues.redhat.com/browse/MON-3808
1517
---
1618

1719
# Metrics collection profiles
1820

1921
## Terms
2022

21-
monitors - refers to the CRDs ServiceMonitor, PodMonitor and Probe from
22-
Prometheus Operator;
23-
24-
users - refers to end-users of OpenShift who manage an OpenShift installation
25-
i.e cluster-admins;
26-
27-
developers - refers to OpenShift developers that build the platform i.e. RedHat
28-
associates and OpenSource contributors;
29-
23+
- Monitors: Refers to the CRDs ServiceMonitor, PodMonitor and Probe from
24+
Prometheus Operator.
25+
- Users: Refers to end-users of OpenShift who manage an OpenShift installation
26+
i.e cluster-admins.
27+
- Developers: Refers to OpenShift developers that build the platform i.e. RedHat
28+
associates and OpenSource contributors.
3029

3130
## Summary
3231

3332
The core OpenShift components ship a large number of metrics. A 4.12-nightly
3433
cluster on AWS (3 control plane nodes + 3 worker nodes) currently produces
35-
around 350,000 unique timeseries, and adding optional operators increases that
34+
around 350,000 unique time-series, and adding optional operators increases that
3635
number. Users have repeatedly asked for a supported method of making Prometheus
3736
consume less memory and CPU, either by increasing the scraping interval or by
3837
scraping fewer targets.
@@ -51,14 +50,14 @@ Nevertheless, users have repeatedly asked for the ability to reduce the amount
5150
of memory consumed by Prometheus either by lowering the Prometheus scrape
5251
intervals or by modifying monitors.
5352

54-
Users currently can not control the aforementioned monitors scraped by
55-
Prometheus since some of the metrics collected are essential for other parts of
56-
the system to function properly: recording rules, alerting rules, console
57-
dashboards, and Red Hat Telemetry. Users also are not allowed to tune the
58-
interval at which Prometheus scrapes targets as this again can have unforeseen
59-
results that can hinder the platform: a low scrape interval value may overwhelm
60-
the platform Prometheus instance while a high interval value may render some of
61-
the default alerts ineffective.
53+
Users currently cannot control the aforementioned monitors scraped by Prometheus
54+
since some of the metrics collected are essential for other parts of the system
55+
to function properly: recording rules, alerting rules, console dashboards, and
56+
Red Hat Telemetry. Users also are not allowed to tune the interval at which
57+
Prometheus scrapes targets as this again can have unforeseen results that can
58+
hinder the platform: a low scrape interval value may overwhelm the platform
59+
Prometheus instance while a high interval value may render some of the default
60+
alerts ineffective.
6261

6362
The goal of this proposal is to allow users to pick their desired level of
6463
scraping while limiting the impact this might have on the platform, via
@@ -80,10 +79,9 @@ the `minimal` to the `full` profile. More details can be consulted in this
8079

8180
Moreover, through Telemetry, we collect for each cluster the top 3 Prometheus
8281
jobs that generate the biggest amount of samples. With this data, we know that
83-
for OpenShift 4.11 the 5 components most often reported as the biggest producers
84-
are: the Kubernetes API servers, the Kubernetes schedulers, kube-state-metrics,
85-
kubelet and the network daemon.
86-
82+
for recent OpenShift versions, the 5 components most often reported as the
83+
biggest producers are: the Kubernetes API servers, the Kubernetes schedulers,
84+
kube-state-metrics, kubelet and the network daemon.
8785

8886
### User Stories
8987

@@ -179,7 +177,7 @@ The goal is to support 2 profiles:
179177

180178
- `full` (same as today)
181179
- `minimal` (only collect metrics necessary for recording rules, alerts,
182-
dashboards, HPA and VPA and telemetry)
180+
dashboards, HPA, VPA and telemetry)
183181

184182
When the cluster admin enables the `minimal` profile, the Prometheus
185183
resource would be configured accordingly:
@@ -253,7 +251,7 @@ spec:
253251
```
254252

255253
Note:
256-
- the `metricRelabelings` section keeps only two metrics, while the rest is
254+
- the `metricRelabelings` section keeps only two metrics, while the rest are
257255
dropped.
258256
- the metrics in the `keep` section were obtained with the help of a script that
259257
parsed all Alerts, PrometheusRules and Console dashboards to determine what
@@ -316,17 +314,16 @@ not. To aid teams with this effort the monitoring team will provide:
316314
to utilize all aspects of this feature into their component's workflow.
317315
- an origin/CI test that validates for all Alerts and PrometheusRules that the
318316
metrics used by them are present in the `keep` expression of the
319-
monitor for the `minimal` profile
320-
317+
monitor for the `minimal` profile.
321318

322319
### Risks and Mitigations
323320

324-
- How are monitors supposed to be kept up to date? In 4.12 a metric that wasn't
325-
being used in an alert is now required, how does the monitor responsible for
321+
- How are monitors supposed to be kept up to date? A metric that wasn't being
322+
used earlier in an alert is now required, how does the monitor responsible for
326323
that metric get updated?
327324
- The origin/CI test mentioned in the previous section will fail if there is a
328325
resource (Alerts/PrometheusRules/Dashboards) using a metric which is not
329-
present in the monitor in question;
326+
present in the monitor in question.
330327

331328
- What happens if a user provides an invalid value for a metrics collection
332329
profile?
@@ -337,24 +334,19 @@ not. To aid teams with this effort the monitoring team will provide:
337334
- Our current validation strategy with only two profiles is quite linear,
338335
however, things start becoming more complex and hard to maintain as we
339336
introduce new profiles to the mix.
340-
- Some of the things to consider if new profiles are introduce are:
341-
- How would we validate such profile?
342-
- How would we ensure teams that adopted metrics collection profiles
343-
implement the new profile?
344-
- How would we aid developers implementing the new profile?
345337

346338
### Drawbacks
347339

348-
- Extra CI cycles
340+
- Extra CI cycles.
349341

350342
## Design Details
351343

352344
### Open Questions
353345

354346
### Test Plan
355347

356-
- Unit tests in CMO to validate that the correct monitors are being selected
357-
- E2E tests in CMO to validate that everything works correctly
348+
- Unit tests in CMO to validate that the correct monitors are being selected.
349+
- E2E tests in CMO to validate that everything works correctly.
358350
- For the `minimal` profile, origin/CI test to validate that every metric used
359351
in a resource (Alerts/PrometheusRules/Dashboards) exists in the `keep`
360352
expression of a minimal monitors.
@@ -368,6 +360,8 @@ shouldn't impact operations.
368360
profile out-of-the-box and removes the earlier-imposed
369361
TechPreview gate. PTAL at the section below for more details.
370362

363+
- GA'd in 4.19: https://github.com/openshift/api/pull/2286
364+
371365
#### Tech Preview -> GA
372366

373367
- [Automation to update metrics in collection profiles](https://issues.redhat.com/browse/MON-3106)
@@ -380,8 +374,13 @@ TechPreview gate. PTAL at the section below for more details.
380374

381375
#### Removing a deprecated feature
382376

383-
- Announce deprecation and support policy of the existing feature
384-
- Deprecate the feature
377+
Deprecation, in the scope of collection profiles, is unlikely as that would
378+
entail moving all existing ServiceMonitors in that profile to either accomodate
379+
themselves in other profiles, or simply not exist anymore, which will need to be
380+
done for all teams. Either of these measures will require teams to rollback
381+
drastically on behaviours they built around in the first place. As it is right
382+
now, we do not plan on deprecating the exposed `full` or `minimal` profiles at
383+
all.
385384

386385
### Upgrade / Downgrade Strategy
387386

@@ -395,27 +394,12 @@ TechPreview gate. PTAL at the section below for more details.
395394
- Once we backport the new monitors selectors upgrades and
396395
downgrades are not expected to present a significant challenge.
397396

398-
### Version Skew Strategy
399-
400-
TBD but I don't think it applies here
401-
402-
### minimal Aspects of API Extensions
403-
404-
TBD but I don't think it applies here
405-
406-
#### Failure Modes
407-
408-
TBD but I don't think it applies here
409-
410-
#### Support Procedures
411-
412-
TBD but I don't think it applies here
413-
414397
## Implementation History
415398

416-
Initial proofs-of-concept:
417-
418399
- https://github.com/openshift/cluster-monitoring-operator/pull/1785
400+
- https://github.com/openshift/cluster-monitoring-operator/pull/2030
401+
- https://github.com/openshift/cluster-monitoring-operator/pull/2047
402+
- https://github.com/openshift/origin/pull/28889
419403

420404
## Alternatives
421405

@@ -463,8 +447,9 @@ Initial proofs-of-concept:
463447
- After some consideration we decided to abandon this idea since it would only
464448
work for resources controlled by CVO which is not the case for the majority
465449
of ServiceMonitors.
466-
467-
## Infrastructure Needed [optional]
450+
- Additionally, this requires users to "commit" to one profile throughout the
451+
cluster lifecycle, which is a bit static for our needs, for more details,
452+
PTAL at [the reasoning here](https://github.com/openshift/enhancements/pull/1298#issuecomment-1895513051).
468453

469454
### Adopted metrics collection profiles
470455

0 commit comments

Comments
 (0)