fix: aggregate all label samples for COUNTER/GAUGE metrics#1170
fix: aggregate all label samples for COUNTER/GAUGE metrics#1170WHOIM1205 wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes a bug in the autoscaler's Prometheus metric processing where only the first sample of a MetricFamily was read, causing undercounting when metrics had multiple label combinations (issue #1063). Aggregation now sums all label samples for COUNTER and GAUGE types, while HISTOGRAM handling is intentionally unchanged.
Changes:
- Iterate over all
mf.Metricsamples for COUNTER and GAUGE types and sum their values viaaddMetric. - Preserve existing single-sample behavior for HISTOGRAM by explicitly using
mf.Metric[0]. - Add unit tests covering multi-label aggregation, single-sample behavior, unwatched/absent metrics, histogram path, and the
addMetrichelper.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/autoscaler/autoscaler/metric_collector.go | Aggregate all label samples for COUNTER/GAUGE; keep HISTOGRAM using first sample. |
| pkg/autoscaler/autoscaler/metric_collector_test.go | New tests covering aggregation, single-sample, absent/unwatched metrics, histogram, and addMetric. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request updates the metric collector to aggregate all label samples within a MetricFamily for COUNTER and GAUGE metrics, preventing undercounting when multiple label variants exist. It also adds comprehensive unit tests to verify this aggregation behavior. The reviewer suggests adding a warning log when multiple samples are detected for a HISTOGRAM metric, as only the first sample is currently processed, which could lead to unexpected behavior.
| case io_prometheus_client.MetricType_HISTOGRAM: | ||
| hist := metric.GetHistogram() | ||
| // HISTOGRAM handling is intentionally unchanged: only the first | ||
| // sample is used (see issue #1063, scoped to COUNTER/GAUGE). | ||
| hist := mf.Metric[0].GetHistogram() | ||
| snapshot := histogram.NewSnapshotOfHistogram(hist) | ||
| currentHistograms[mf.GetName()] = snapshot |
There was a problem hiding this comment.
While aggregating histograms might be out of scope for this PR, silently ignoring other samples when a HISTOGRAM has multiple label variants (i.e., len(mf.Metric) > 1) can lead to unexpected behavior and is hard to debug. Consider logging a warning when multiple samples are detected for a histogram so that users are aware of this limitation.
case io_prometheus_client.MetricType_HISTOGRAM:
// HISTOGRAM handling is intentionally unchanged: only the first
// sample is used (see issue #1063, scoped to COUNTER/GAUGE).
if len(mf.Metric) > 1 {
klog.Warningf("Metric family %s has multiple histogram samples, but only the first one is used", mf.GetName())
}
hist := mf.Metric[0].GetHistogram()
snapshot := histogram.NewSnapshotOfHistogram(hist)
currentHistograms[mf.GetName()] = snapshotSigned-off-by: WHOIM1205 <rathourprateek8@gmail.com>
7991ab9 to
671da97
Compare
You're right, I took a closer look at #1064 after your comment. After comparing them, I agree both PRs are addressing the same underlying issue from #1063 by aggregating all samples within a MetricFamily for COUNTER and GAUGE metrics. The differences seem to be mostly around defensive checks and test coverage rather than the core fix itself. Since #1064 came first, I'm happy to defer to it. If useful, I can contribute any additional test coverage from #1170 there instead of maintaining a competing PR. |
Summary
Fixes #1063 by aggregating all samples within a Prometheus MetricFamily for COUNTER and GAUGE metrics.
Previously, the autoscaler only used the first sample from a metric family:
This caused incorrect metric values when a metric exposed multiple label variants. For example:
The autoscaler would only see
2instead of the expected aggregated value5.Changes
Testing
go test ./pkg/autoscaler/autoscaler/...The new tests fail against the previous implementation and pass with this change.