fix: aggregate all label samples for COUNTER/GAUGE metrics by WHOIM1205 · Pull Request #1170 · volcano-sh/kthena

WHOIM1205 · 2026-05-29T16:18:05Z

Summary

Fixes #1063 by aggregating all samples within a Prometheus MetricFamily for COUNTER and GAUGE metrics.

Previously, the autoscaler only used the first sample from a metric family:

metric := mf.Metric[0]

This caused incorrect metric values when a metric exposed multiple label variants. For example:

requests_total{status="success"} 2
requests_total{status="error"} 3

The autoscaler would only see 2 instead of the expected aggregated value 5.

Changes

Aggregate all samples in a MetricFamily for COUNTER metrics.
Aggregate all samples in a MetricFamily for GAUGE metrics.
Keep HISTOGRAM handling unchanged.
Add unit tests covering:
- multi-label counters
- multi-label gauges
- single-sample metrics
- ignored/unwatched metrics
- histogram regression coverage

Testing

go test ./pkg/autoscaler/autoscaler/...

The new tests fail against the previous implementation and pass with this change.

volcano-sh-bot · 2026-05-29T16:18:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a bug in the autoscaler's Prometheus metric processing where only the first sample of a MetricFamily was read, causing undercounting when metrics had multiple label combinations (issue #1063). Aggregation now sums all label samples for COUNTER and GAUGE types, while HISTOGRAM handling is intentionally unchanged.

Changes:

Iterate over all mf.Metric samples for COUNTER and GAUGE types and sum their values via addMetric.
Preserve existing single-sample behavior for HISTOGRAM by explicitly using mf.Metric[0].
Add unit tests covering multi-label aggregation, single-sample behavior, unwatched/absent metrics, histogram path, and the addMetric helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
pkg/autoscaler/autoscaler/metric_collector.go	Aggregate all label samples for COUNTER/GAUGE; keep HISTOGRAM using first sample.
pkg/autoscaler/autoscaler/metric_collector_test.go	New tests covering aggregation, single-sample, absent/unwatched metrics, histogram, and `addMetric`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist

Code Review

This pull request updates the metric collector to aggregate all label samples within a MetricFamily for COUNTER and GAUGE metrics, preventing undercounting when multiple label variants exist. It also adds comprehensive unit tests to verify this aggregation behavior. The reviewer suggests adding a warning log when multiple samples are detected for a HISTOGRAM metric, as only the first sample is currently processed, which could lead to unexpected behavior.

gemini-code-assist · 2026-05-29T16:20:04Z

 		case io_prometheus_client.MetricType_HISTOGRAM:
-			hist := metric.GetHistogram()
+			// HISTOGRAM handling is intentionally unchanged: only the first
+			// sample is used (see issue #1063, scoped to COUNTER/GAUGE).
+			hist := mf.Metric[0].GetHistogram()
 			snapshot := histogram.NewSnapshotOfHistogram(hist)
 			currentHistograms[mf.GetName()] = snapshot


While aggregating histograms might be out of scope for this PR, silently ignoring other samples when a HISTOGRAM has multiple label variants (i.e., len(mf.Metric) > 1) can lead to unexpected behavior and is hard to debug. Consider logging a warning when multiple samples are detected for a histogram so that users are aware of this limitation.

case io_prometheus_client.MetricType_HISTOGRAM: // HISTOGRAM handling is intentionally unchanged: only the first // sample is used (see issue #1063, scoped to COUNTER/GAUGE). if len(mf.Metric) > 1 { klog.Warningf("Metric family %s has multiple histogram samples, but only the first one is used", mf.GetName()) } hist := mf.Metric[0].GetHistogram() snapshot := histogram.NewSnapshotOfHistogram(hist) currentHistograms[mf.GetName()] = snapshot

hzxuzhonghu · 2026-05-30T02:42:14Z

Since you have read #1063, you definitely know #1064, can you explain what is the diff with that

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>

WHOIM1205 · 2026-05-30T21:42:13Z

Since you have read #1063, you definitely know #1064, can you explain what is the diff with that

You're right, I took a closer look at #1064 after your comment.

After comparing them, I agree both PRs are addressing the same underlying issue from #1063 by aggregating all samples within a MetricFamily for COUNTER and GAUGE metrics.

The differences seem to be mostly around defensive checks and test coverage rather than the core fix itself.

Since #1064 came first, I'm happy to defer to it. If useful, I can contribute any additional test coverage from #1170 there instead of maintaining a competing PR.

Copilot AI review requested due to automatic review settings May 29, 2026 16:18

volcano-sh-bot requested review from LiZhenCheng9527 and git-malu May 29, 2026 16:18

volcano-sh-bot added the size/L label May 29, 2026

Copilot AI reviewed May 29, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

fix: aggregate all label samples for COUNTER/GAUGE metrics

671da97

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>

WHOIM1205 force-pushed the fix/issue-1063-aggregate-prometheus-samples branch from 7991ab9 to 671da97 Compare May 30, 2026 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: aggregate all label samples for COUNTER/GAUGE metrics#1170

fix: aggregate all label samples for COUNTER/GAUGE metrics#1170
WHOIM1205 wants to merge 1 commit into
volcano-sh:mainfrom
WHOIM1205:fix/issue-1063-aggregate-prometheus-samples

WHOIM1205 commented May 29, 2026

Uh oh!

volcano-sh-bot commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

hzxuzhonghu commented May 30, 2026

Uh oh!

WHOIM1205 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

WHOIM1205 commented May 29, 2026

Summary

Changes

Testing

Uh oh!

volcano-sh-bot commented May 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu commented May 30, 2026

Uh oh!

WHOIM1205 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants