fix: aggregate labeled autoscaler metrics by pm-ju · Pull Request #1064 · volcano-sh/kthena

pm-ju · 2026-05-15T07:24:16Z

/kind bug

The autoscaler Prometheus parser only used the first sample in each MetricFamily (mf.Metric[0]).

For labeled counter/gauge metrics, Prometheus returns multiple samples under the same metric family. Ignoring all samples after the first can undercount watched autoscaling metrics and make scaling decisions use incomplete data.

Fixes #1063

This change:

aggregates all counter samples in a watched MetricFamily
aggregates all gauge samples in a watched MetricFamily
skips malformed counter/gauge/histogram samples instead of dereferencing missing values
adds regression coverage for labeled counter and gauge samples

Signed-off-by: pm-ju <pmdevops29@gmail.com>

volcano-sh-bot · 2026-05-15T07:24:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request improves the metric collector by iterating through and aggregating all labeled samples within a Prometheus MetricFamily for COUNTER and GAUGE types, ensuring a more complete data set. It also introduces nil checks for metric values and adds a unit test to verify the aggregation of labeled samples. Feedback highlights an inconsistency in the handling of HISTOGRAM types, which still only process the first sample and may lead to incomplete data if multiple labeled histograms are present.

gemini-code-assist · 2026-05-15T07:26:02Z

 		case io_prometheus_client.MetricType_HISTOGRAM:
+			metric := mf.Metric[0]
+			if metric.GetHistogram() == nil {
+				klog.Errorf("histogram metric %s is missing histogram value", mf.GetName())
+				continue
+			}
 			hist := metric.GetHistogram()
 			snapshot := histogram.NewSnapshotOfHistogram(hist)
 			currentHistograms[mf.GetName()] = snapshot


The handling of MetricType_HISTOGRAM is inconsistent with the improvements made for COUNTER and GAUGE. While the latter now aggregate all labeled samples in a MetricFamily, histograms still only process the first sample (mf.Metric[0]).

If a metric family contains multiple labeled histograms, all samples after the first are ignored, which leads to the "incomplete data" issue described in the PR for other types. Furthermore, if the first sample happens to be malformed, the entire family is skipped even if subsequent samples are valid.

While merging histograms is more complex than summing scalars, consider at least iterating to find the first valid sample, or ideally, implementing a merge strategy to ensure the autoscaler has a complete view of the metric.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

FAUST-BENCHOU

not sure can we use this kind of way to directly add all metrics

pm-ju · 2026-05-16T11:20:06Z

not sure can we use this kind of way to directly add all metrics

you’re right. Blindly summing all labeled samples is not always safe, especially for gauges/ratios.

My intent was additive metrics like request counters, but this should be narrowed or made explicit. Would it be okay if I limit this PR to counters only and leave gauges/histograms unchanged?

@FAUST-BENCHOU

FAUST-BENCHOU · 2026-05-16T11:42:25Z

not sure can we use this kind of way to directly add all metrics

you’re right. Blindly summing all labeled samples is not always safe, especially for gauges/ratios.

My intent was additive metrics like request counters, but this should be narrowed or made explicit. Would it be okay if I limit this PR to counters only and leave gauges/histograms unchanged?

@FAUST-BENCHOU

i'm not familiar with this part, maybe wait for community's feedback

hzxuzhonghu

can you consistently use same parsing like pkg/kthena-router/backend/metrics/metrics.go

It use expfmt.NewTextParser

Signed-off-by: pm-ju <pmdevops29@gmail.com>

…etheus-sample-aggregation Signed-off-by: pm-ju <pmdevops29@gmail.com> # Conflicts: # pkg/autoscaler/autoscaler/metric_collector.go

volcano-sh-bot · 2026-05-18T11:59:16Z

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

hzxuzhonghu · 2026-05-19T09:23:47Z

+	parser := expfmt.NewTextParser(model.UTF8Validation)
+	metricFamilies, err := parser.TextToMetricFamilies(strings.NewReader(metricStr))
+	if err != nil {
+		klog.Errorf("error parsing metric families: %v", err)


+		case dto.MetricType_HISTOGRAM:
+			metric := mf.Metric[0]
+			if metric.GetHistogram() == nil {
+				klog.Errorf("histogram metric %s is missing histogram value", mf.GetName())
+				continue
+			}
 			hist := metric.GetHistogram()
 			snapshot := histogram.NewSnapshotOfHistogram(hist)
 			currentHistograms[mf.GetName()] = snapshot


pm-ju · 2026-05-18T12:06:06Z

can you consistently use same parsing like pkg/kthena-router/backend/metrics/metrics.go

It use expfmt.NewTextParser

Updated, thanks. The autoscaler parser now uses expfmt.NewTextParser(model.UTF8Validation) like pkg/kthena-router/backend/metrics/metrics.go, while keeping the sample aggregation behavior.

pm-ju · 2026-05-18T13:12:56Z

if anyone know about this failing test could you tell me. I dont think my code touches any part where this particular gateway-api might be failing.

Signed-off-by: pm-ju <pmdevops29@gmail.com>

hzxuzhonghu · 2026-05-19T09:24:49Z

@pm-ju Looks like a flake

hzxuzhonghu

good can you add test case fro HISTOGRAM metric

fix: aggregate labeled autoscaler metrics

b75fa87

Signed-off-by: pm-ju <pmdevops29@gmail.com>

Copilot AI review requested due to automatic review settings May 15, 2026 07:24

volcano-sh-bot added the kind/bug label May 15, 2026

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu May 15, 2026 07:24

volcano-sh-bot added the size/M label May 15, 2026

Copilot started reviewing on behalf of pm-ju May 15, 2026 07:24 View session

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

Copilot AI reviewed May 15, 2026

View reviewed changes

FAUST-BENCHOU reviewed May 16, 2026

View reviewed changes

hzxuzhonghu reviewed May 18, 2026

View reviewed changes

pm-ju added 2 commits May 18, 2026 17:23

fix: align autoscaler prometheus parsing

18822d1

Signed-off-by: pm-ju <pmdevops29@gmail.com>

Merge remote-tracking branch 'upstream/main' into fix-autoscaler-prom…

f771fb4

…etheus-sample-aggregation Signed-off-by: pm-ju <pmdevops29@gmail.com> # Conflicts: # pkg/autoscaler/autoscaler/metric_collector.go

Copilot AI review requested due to automatic review settings May 18, 2026 11:59

volcano-sh-bot added the do-not-merge/contains-merge-commits label May 18, 2026

Copilot started reviewing on behalf of pm-ju May 18, 2026 11:59 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

test: strengthen autoscaler metric aggregation coverage

acf3d1e

Signed-off-by: pm-ju <pmdevops29@gmail.com>

volcano-sh-bot added size/L and removed size/M labels May 18, 2026

hzxuzhonghu reviewed May 19, 2026

View reviewed changes

Conversation

pm-ju commented May 15, 2026

Uh oh!

volcano-sh-bot commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

FAUST-BENCHOU left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pm-ju commented May 16, 2026

Uh oh!

FAUST-BENCHOU commented May 16, 2026

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

volcano-sh-bot commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

hzxuzhonghu May 19, 2026

Choose a reason for hiding this comment

Uh oh!

pm-ju commented May 18, 2026

Uh oh!

pm-ju commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzxuzhonghu commented May 19, 2026

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

FAUST-BENCHOU left a comment •

edited

Loading

pm-ju commented May 18, 2026 •

edited

Loading