Skip to content

fix: aggregate all label samples for COUNTER/GAUGE metrics#1170

Open
WHOIM1205 wants to merge 1 commit into
volcano-sh:mainfrom
WHOIM1205:fix/issue-1063-aggregate-prometheus-samples
Open

fix: aggregate all label samples for COUNTER/GAUGE metrics#1170
WHOIM1205 wants to merge 1 commit into
volcano-sh:mainfrom
WHOIM1205:fix/issue-1063-aggregate-prometheus-samples

Conversation

@WHOIM1205

Copy link
Copy Markdown
Contributor

Summary

Fixes #1063 by aggregating all samples within a Prometheus MetricFamily for COUNTER and GAUGE metrics.

Previously, the autoscaler only used the first sample from a metric family:

metric := mf.Metric[0]

This caused incorrect metric values when a metric exposed multiple label variants. For example:

requests_total{status="success"} 2
requests_total{status="error"} 3

The autoscaler would only see 2 instead of the expected aggregated value 5.

Changes

  • Aggregate all samples in a MetricFamily for COUNTER metrics.
  • Aggregate all samples in a MetricFamily for GAUGE metrics.
  • Keep HISTOGRAM handling unchanged.
  • Add unit tests covering:
    • multi-label counters
    • multi-label gauges
    • single-sample metrics
    • ignored/unwatched metrics
    • histogram regression coverage

Testing

go test ./pkg/autoscaler/autoscaler/...

The new tests fail against the previous implementation and pass with this change.

Copilot AI review requested due to automatic review settings May 29, 2026 16:18
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a bug in the autoscaler's Prometheus metric processing where only the first sample of a MetricFamily was read, causing undercounting when metrics had multiple label combinations (issue #1063). Aggregation now sums all label samples for COUNTER and GAUGE types, while HISTOGRAM handling is intentionally unchanged.

Changes:

  • Iterate over all mf.Metric samples for COUNTER and GAUGE types and sum their values via addMetric.
  • Preserve existing single-sample behavior for HISTOGRAM by explicitly using mf.Metric[0].
  • Add unit tests covering multi-label aggregation, single-sample behavior, unwatched/absent metrics, histogram path, and the addMetric helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pkg/autoscaler/autoscaler/metric_collector.go Aggregate all label samples for COUNTER/GAUGE; keep HISTOGRAM using first sample.
pkg/autoscaler/autoscaler/metric_collector_test.go New tests covering aggregation, single-sample, absent/unwatched metrics, histogram, and addMetric.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the metric collector to aggregate all label samples within a MetricFamily for COUNTER and GAUGE metrics, preventing undercounting when multiple label variants exist. It also adds comprehensive unit tests to verify this aggregation behavior. The reviewer suggests adding a warning log when multiple samples are detected for a HISTOGRAM metric, as only the first sample is currently processed, which could lead to unexpected behavior.

Comment on lines 226 to 231
case io_prometheus_client.MetricType_HISTOGRAM:
hist := metric.GetHistogram()
// HISTOGRAM handling is intentionally unchanged: only the first
// sample is used (see issue #1063, scoped to COUNTER/GAUGE).
hist := mf.Metric[0].GetHistogram()
snapshot := histogram.NewSnapshotOfHistogram(hist)
currentHistograms[mf.GetName()] = snapshot

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While aggregating histograms might be out of scope for this PR, silently ignoring other samples when a HISTOGRAM has multiple label variants (i.e., len(mf.Metric) > 1) can lead to unexpected behavior and is hard to debug. Consider logging a warning when multiple samples are detected for a histogram so that users are aware of this limitation.

		case io_prometheus_client.MetricType_HISTOGRAM:
			// HISTOGRAM handling is intentionally unchanged: only the first
			// sample is used (see issue #1063, scoped to COUNTER/GAUGE).
			if len(mf.Metric) > 1 {
				klog.Warningf("Metric family %s has multiple histogram samples, but only the first one is used", mf.GetName())
			}
			hist := mf.Metric[0].GetHistogram()
			snapshot := histogram.NewSnapshotOfHistogram(hist)
			currentHistograms[mf.GetName()] = snapshot

@hzxuzhonghu

Copy link
Copy Markdown
Member

Since you have read #1063, you definitely know #1064, can you explain what is the diff with that

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@WHOIM1205 WHOIM1205 force-pushed the fix/issue-1063-aggregate-prometheus-samples branch from 7991ab9 to 671da97 Compare May 30, 2026 21:37
@WHOIM1205

Copy link
Copy Markdown
Contributor Author

Since you have read #1063, you definitely know #1064, can you explain what is the diff with that

You're right, I took a closer look at #1064 after your comment.

After comparing them, I agree both PRs are addressing the same underlying issue from #1063 by aggregating all samples within a MetricFamily for COUNTER and GAUGE metrics.

The differences seem to be mostly around defensive checks and test coverage rather than the core fix itself.

Since #1064 came first, I'm happy to defer to it. If useful, I can contribute any additional test coverage from #1170 there instead of maintaining a competing PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Autoscaler only reads the first Prometheus sample for labeled metrics

4 participants