Skip to content

Commit 577a948

Browse files
committed
MetricsCollectionProfiles: Reword and update KEP
Re-opening the KEP PR to backfill on the required proposal context. Signed-off-by: Pranshu Srivastava <[email protected]>
1 parent 8f763cb commit 577a948

File tree

1 file changed

+100
-74
lines changed

1 file changed

+100
-74
lines changed

enhancements/monitoring/metrics-collection-profiles.md

Lines changed: 100 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -2,37 +2,37 @@
22
title: metrics-collection-profiles
33
authors:
44
- JoaoBraveCoding
5+
- rexagod
56
reviewers:
67
- openshift/openshift-team-monitoring
78
approvers:
8-
- TBD
9-
api-approvers: "None"
9+
- "@openshift/openshift-team-monitoring"
10+
api-approvers:
11+
- "@dgrisonnet"
1012
creation-date: 2022-12-06
11-
last-updated: 2023-07-24
13+
last-updated: 2025-05-06
1214
tracking-link:
1315
- https://issues.redhat.com/browse/MON-2483
1416
- https://issues.redhat.com/browse/MON-3043
17+
- https://issues.redhat.com/browse/MON-3808
1518
---
1619

1720
# Metrics collection profiles
1821

1922
## Terms
2023

21-
monitors - refers to the CRDs ServiceMonitor, PodMonitor and Probe from
22-
Prometheus Operator;
23-
24-
users - refers to end-users of OpenShift who manage an OpenShift installation
25-
i.e cluster-admins;
26-
27-
developers - refers to OpenShift developers that build the platform i.e. RedHat
28-
associates and OpenSource contributors;
29-
24+
- Monitors: Refers to the CRDs ServiceMonitor, PodMonitor and Probe from
25+
Prometheus Operator.
26+
- Users: Refers to end-users of OpenShift who manage an OpenShift installation
27+
i.e cluster-admins.
28+
- Developers: Refers to OpenShift developers that build the platform i.e. RedHat
29+
associates and OpenSource contributors.
3030

3131
## Summary
3232

3333
The core OpenShift components ship a large number of metrics. A 4.12-nightly
3434
cluster on AWS (3 control plane nodes + 3 worker nodes) currently produces
35-
around 350,000 unique timeseries, and adding optional operators increases that
35+
around 350,000 unique time-series, and adding optional operators increases that
3636
number. Users have repeatedly asked for a supported method of making Prometheus
3737
consume less memory and CPU, either by increasing the scraping interval or by
3838
scraping fewer targets.
@@ -51,14 +51,14 @@ Nevertheless, users have repeatedly asked for the ability to reduce the amount
5151
of memory consumed by Prometheus either by lowering the Prometheus scrape
5252
intervals or by modifying monitors.
5353

54-
Users currently can not control the aforementioned monitors scraped by
55-
Prometheus since some of the metrics collected are essential for other parts of
56-
the system to function properly: recording rules, alerting rules, console
57-
dashboards, and Red Hat Telemetry. Users also are not allowed to tune the
58-
interval at which Prometheus scrapes targets as this again can have unforeseen
59-
results that can hinder the platform: a low scrape interval value may overwhelm
60-
the platform Prometheus instance while a high interval value may render some of
61-
the default alerts ineffective.
54+
Users currently cannot control the aforementioned monitors scraped by Prometheus
55+
since some of the metrics collected are essential for other parts of the system
56+
to function properly: recording rules, alerting rules, console dashboards, and
57+
Red Hat Telemetry. Users also are not allowed to tune the interval at which
58+
Prometheus scrapes targets as this again can have unforeseen results that can
59+
hinder the platform: a low scrape interval value may overwhelm the platform
60+
Prometheus instance while a high interval value may render some of the default
61+
alerts ineffective.
6262

6363
The goal of this proposal is to allow users to pick their desired level of
6464
scraping while limiting the impact this might have on the platform, via
@@ -80,10 +80,9 @@ the `minimal` to the `full` profile. More details can be consulted in this
8080

8181
Moreover, through Telemetry, we collect for each cluster the top 3 Prometheus
8282
jobs that generate the biggest amount of samples. With this data, we know that
83-
for OpenShift 4.11 the 5 components most often reported as the biggest producers
84-
are: the Kubernetes API servers, the Kubernetes schedulers, kube-state-metrics,
85-
kubelet and the network daemon.
86-
83+
for recent OpenShift versions, the 5 components most often reported as the
84+
biggest producers are: the Kubernetes API servers, the Kubernetes schedulers,
85+
kube-state-metrics, kubelet and the network daemon.
8786

8887
### User Stories
8988

@@ -179,7 +178,7 @@ The goal is to support 2 profiles:
179178

180179
- `full` (same as today)
181180
- `minimal` (only collect metrics necessary for recording rules, alerts,
182-
dashboards, HPA and VPA and telemetry)
181+
dashboards, HPA, VPA and telemetry)
183182

184183
When the cluster admin enables the `minimal` profile, the Prometheus
185184
resource would be configured accordingly:
@@ -253,7 +252,7 @@ spec:
253252
```
254253

255254
Note:
256-
- the `metricRelabelings` section keeps only two metrics, while the rest is
255+
- the `metricRelabelings` section keeps only two metrics, while the rest are
257256
dropped.
258257
- the metrics in the `keep` section were obtained with the help of a script that
259258
parsed all Alerts, PrometheusRules and Console dashboards to determine what
@@ -316,17 +315,16 @@ not. To aid teams with this effort the monitoring team will provide:
316315
to utilize all aspects of this feature into their component's workflow.
317316
- an origin/CI test that validates for all Alerts and PrometheusRules that the
318317
metrics used by them are present in the `keep` expression of the
319-
monitor for the `minimal` profile
320-
318+
monitor for the `minimal` profile.
321319

322320
### Risks and Mitigations
323321

324-
- How are monitors supposed to be kept up to date? In 4.12 a metric that wasn't
325-
being used in an alert is now required, how does the monitor responsible for
322+
- How are monitors supposed to be kept up to date? A metric that wasn't being
323+
used earlier in an alert is now required, how does the monitor responsible for
326324
that metric get updated?
327325
- The origin/CI test mentioned in the previous section will fail if there is a
328326
resource (Alerts/PrometheusRules/Dashboards) using a metric which is not
329-
present in the monitor in question;
327+
present in the monitor in question.
330328

331329
- What happens if a user provides an invalid value for a metrics collection
332330
profile?
@@ -337,29 +335,24 @@ not. To aid teams with this effort the monitoring team will provide:
337335
- Our current validation strategy with only two profiles is quite linear,
338336
however, things start becoming more complex and hard to maintain as we
339337
introduce new profiles to the mix.
340-
- Some of the things to consider if new profiles are introduce are:
341-
- How would we validate such profile?
342-
- How would we ensure teams that adopted metrics collection profiles
343-
implement the new profile?
344-
- How would we aid developers implementing the new profile?
345338

346339
### Drawbacks
347340

348-
- Extra CI cycles
341+
- Extra CI cycles.
349342

350343
## Design Details
351344

352345
### Open Questions
353346

354-
### Test Plan
347+
## Test Plan
355348

356-
- Unit tests in CMO to validate that the correct monitors are being selected
357-
- E2E tests in CMO to validate that everything works correctly
349+
- Unit tests in CMO to validate that the correct monitors are being selected.
350+
- E2E tests in CMO to validate that everything works correctly.
358351
- For the `minimal` profile, origin/CI test to validate that every metric used
359352
in a resource (Alerts/PrometheusRules/Dashboards) exists in the `keep`
360353
expression of a minimal monitors.
361354

362-
### Graduation Criteria
355+
## Graduation Criteria
363356

364357
- Released as TechPreview: the default being `full`, it
365358
shouldn't impact operations.
@@ -368,7 +361,13 @@ shouldn't impact operations.
368361
profile out-of-the-box and removes the earlier-imposed
369362
TechPreview gate. PTAL at the section below for more details.
370363

371-
#### Tech Preview -> GA
364+
- GA'd in 4.19: https://github.com/openshift/api/pull/2286
365+
366+
### Dev Preview -> Tech Preview
367+
368+
- [Design scrape profiles in CMO](https://issues.redhat.com/browse/MON-2483)
369+
370+
### Tech Preview -> GA
372371

373372
- [Automation to update metrics in collection profiles](https://issues.redhat.com/browse/MON-3106)
374373
- [Telemetry signal for collection profile usage](https://issues.redhat.com/browse/MON-3231)
@@ -378,12 +377,17 @@ TechPreview gate. PTAL at the section below for more details.
378377
- [origin/CI tool to validate collection profiles](https://issues.redhat.com/browse/MON-3105)
379378
- [User facing documentation created in OpenShift-docs](https://issues.redhat.com/browse/OBSDOCS-330)
380379

381-
#### Removing a deprecated feature
380+
### Removing a deprecated feature
382381

383-
- Announce deprecation and support policy of the existing feature
384-
- Deprecate the feature
382+
Deprecation, in the scope of collection profiles, is unlikely as that would
383+
entail moving all existing ServiceMonitors in that profile to either accomodate
384+
themselves in other profiles, or simply not exist anymore, which will need to be
385+
done for all teams. Either of these measures will require teams to rollback
386+
drastically on behaviours they built around in the first place. As it is right
387+
now, we do not plan on deprecating the exposed `full` or `minimal` profiles at
388+
all.
385389

386-
### Upgrade / Downgrade Strategy
390+
## Upgrade / Downgrade Strategy
387391

388392
- If metrics collection profiles is accepted and is released to 4.13 then we
389393
must backport the new monitors selectors to 4.12. The reason being that when
@@ -395,29 +399,14 @@ TechPreview gate. PTAL at the section below for more details.
395399
- Once we backport the new monitors selectors upgrades and
396400
downgrades are not expected to present a significant challenge.
397401

398-
### Version Skew Strategy
399-
400-
TBD but I don't think it applies here
401-
402-
### minimal Aspects of API Extensions
403-
404-
TBD but I don't think it applies here
405-
406-
#### Failure Modes
407-
408-
TBD but I don't think it applies here
409-
410-
#### Support Procedures
411-
412-
TBD but I don't think it applies here
413-
414402
## Implementation History
415403

416-
Initial proofs-of-concept:
417-
418404
- https://github.com/openshift/cluster-monitoring-operator/pull/1785
405+
- https://github.com/openshift/cluster-monitoring-operator/pull/2030
406+
- https://github.com/openshift/cluster-monitoring-operator/pull/2047
407+
- https://github.com/openshift/origin/pull/28889
419408

420-
## Alternatives
409+
## Alternatives (Not Implemented)
421410

422411
- Make CMO injecting metric relabelling for all service monitors based on the
423412
rules being deployed, but this is not a good idea because:
@@ -453,18 +442,21 @@ Initial proofs-of-concept:
453442
- Recently Azure also added support for metrics collection profiles:
454443
- [Azure
455444
Docs](https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/prometheus-metrics-scrape-configuration-minimal)
456-
- https://github.com/Azure/prometheus-collector
445+
- https://github.com/Azure/prometheus-collector
457446
- In their approach they also have
458-
[hardcoded](https://github.com/Azure/prometheus-collector/blob/66ed1a5a27781d7e7e3bb1771b11f1da25ffa79c/otelcollector/configmapparser/tomlparser-default-targets-metrics-keep-list.rb#L28)
459-
set of metrics that are only consumed when the minimal profile is enabled.
460-
However, customers are also able to extend this minimal profile with regexes
461-
to include metrics which might be interesting to them.
462-
- Leverage [installer capabilities](https://docs.google.com/document/d/1I-YT7LKKDHSBLB6Hmg0tZ54DWjrAxlVdXxlViShMu-0/edit#heading=h.848jsje80fru)
447+
[hardcoded](https://github.com/Azure/prometheus-collector/blob/66ed1a5a27781d7e7e3bb1771b11f1da25ffa79c/otelcollector/configmapparser/tomlparser-default-targets-metrics-keep-list.rb#L28)
448+
set of metrics that are only consumed when the minimal profile is enabled.
449+
However, customers are also able to extend this minimal profile with regexes
450+
to include metrics which might be interesting to them.
451+
- Leverage [installer
452+
capabilities](https://docs.google.com/document/d/1I-YT7LKKDHSBLB6Hmg0tZ54DWjrAxlVdXxlViShMu-0/edit#heading=h.848jsje80fru)
463453
- After some consideration we decided to abandon this idea since it would only
464454
work for resources controlled by CVO which is not the case for the majority
465455
of ServiceMonitors.
466-
467-
## Infrastructure Needed [optional]
456+
- Additionally, this requires users to "commit" to one profile throughout the
457+
cluster lifecycle, which is a bit static for our needs, for more details,
458+
PTAL at [the reasoning
459+
here](https://github.com/openshift/enhancements/pull/1298#issuecomment-1895513051).
468460

469461
### Adopted metrics collection profiles
470462

@@ -481,3 +473,37 @@ and implementation status. Possible implementation status:
481473
| Monitoring Team | kube-state-metrics | Implemented |
482474
| Monitoring Team | node-exporter | Implemented |
483475
| Monitoring Team | prometheus-adapter | Implemented |
476+
477+
### Topology Considerations
478+
479+
Supported on all topologies that deploy CMO.
480+
481+
#### Hypershift / Hosted Control Planes
482+
483+
N/A
484+
485+
#### Standalone Clusters
486+
487+
N/A
488+
489+
#### Single-node Deployments or MicroShift
490+
491+
N/A
492+
493+
## Support Procedures
494+
495+
- The `full` collection profile is meant to be synonymous with the exhibited
496+
behavior before the introduction of this patch, as such, users can switch to it
497+
if other profiles are not working as expected.
498+
- The aforementioned utilities (for eg., CPV) can be used to help diagnose the
499+
issue further.
500+
501+
## Version Skew Strategy
502+
503+
The feature-set depends on Prometheus-operator as the provider component for
504+
the resources it works on, which is shipped with CMO. No version skew is
505+
expected.
506+
507+
## Operational Aspects of API Extensions
508+
509+
N/A

0 commit comments

Comments
 (0)