Skip to content

fix: aggregate labeled autoscaler metrics#1064

Open
pm-ju wants to merge 4 commits into
volcano-sh:mainfrom
pm-ju:fix-autoscaler-prometheus-sample-aggregation
Open

fix: aggregate labeled autoscaler metrics#1064
pm-ju wants to merge 4 commits into
volcano-sh:mainfrom
pm-ju:fix-autoscaler-prometheus-sample-aggregation

Conversation

@pm-ju
Copy link
Copy Markdown
Contributor

@pm-ju pm-ju commented May 15, 2026

/kind bug

The autoscaler Prometheus parser only used the first sample in each MetricFamily (mf.Metric[0]).

For labeled counter/gauge metrics, Prometheus returns multiple samples under the same metric family. Ignoring all samples after the first can undercount watched autoscaling metrics and make scaling decisions use incomplete data.

Fixes #1063

This change:

  • aggregates all counter samples in a watched MetricFamily
  • aggregates all gauge samples in a watched MetricFamily
  • skips malformed counter/gauge/histogram samples instead of dereferencing missing values
  • adds regression coverage for labeled counter and gauge samples

Signed-off-by: pm-ju <pmdevops29@gmail.com>
Copilot AI review requested due to automatic review settings May 15, 2026 07:24
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the metric collector by iterating through and aggregating all labeled samples within a Prometheus MetricFamily for COUNTER and GAUGE types, ensuring a more complete data set. It also introduces nil checks for metric values and adds a unit test to verify the aggregation of labeled samples. Feedback highlights an inconsistency in the handling of HISTOGRAM types, which still only process the first sample and may lead to incomplete data if multiple labeled histograms are present.

Comment on lines 227 to 235
case io_prometheus_client.MetricType_HISTOGRAM:
metric := mf.Metric[0]
if metric.GetHistogram() == nil {
klog.Errorf("histogram metric %s is missing histogram value", mf.GetName())
continue
}
hist := metric.GetHistogram()
snapshot := histogram.NewSnapshotOfHistogram(hist)
currentHistograms[mf.GetName()] = snapshot
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The handling of MetricType_HISTOGRAM is inconsistent with the improvements made for COUNTER and GAUGE. While the latter now aggregate all labeled samples in a MetricFamily, histograms still only process the first sample (mf.Metric[0]).

If a metric family contains multiple labeled histograms, all samples after the first are ignored, which leads to the "incomplete data" issue described in the PR for other types. Furthermore, if the first sample happens to be malformed, the entire family is skipped even if subsequent samples are valid.

While merging histograms is more complex than summing scalars, consider at least iterating to find the first valid sample, or ideally, implementing a merge strategy to ensure the autoscaler has a complete view of the metric.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

@FAUST-BENCHOU FAUST-BENCHOU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure can we use this kind of way to directly add all metrics

@pm-ju
Copy link
Copy Markdown
Contributor Author

pm-ju commented May 16, 2026

not sure can we use this kind of way to directly add all metrics

you’re right. Blindly summing all labeled samples is not always safe, especially for gauges/ratios.

My intent was additive metrics like request counters, but this should be narrowed or made explicit. Would it be okay if I limit this PR to counters only and leave gauges/histograms unchanged?

@FAUST-BENCHOU

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor

not sure can we use this kind of way to directly add all metrics

you’re right. Blindly summing all labeled samples is not always safe, especially for gauges/ratios.

My intent was additive metrics like request counters, but this should be narrowed or made explicit. Would it be okay if I limit this PR to counters only and leave gauges/histograms unchanged?

@FAUST-BENCHOU

i'm not familiar with this part, maybe wait for community's feedback

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you consistently use same parsing like pkg/kthena-router/backend/metrics/metrics.go

It use expfmt.NewTextParser

pm-ju added 2 commits May 18, 2026 17:23
Signed-off-by: pm-ju <pmdevops29@gmail.com>
…etheus-sample-aggregation

Signed-off-by: pm-ju <pmdevops29@gmail.com>

# Conflicts:
#	pkg/autoscaler/autoscaler/metric_collector.go
Copilot AI review requested due to automatic review settings May 18, 2026 11:59
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

parser := expfmt.NewTextParser(model.UTF8Validation)
metricFamilies, err := parser.TextToMetricFamilies(strings.NewReader(metricStr))
if err != nil {
klog.Errorf("error parsing metric families: %v", err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix

Comment on lines +226 to 234
case dto.MetricType_HISTOGRAM:
metric := mf.Metric[0]
if metric.GetHistogram() == nil {
klog.Errorf("histogram metric %s is missing histogram value", mf.GetName())
continue
}
hist := metric.GetHistogram()
snapshot := histogram.NewSnapshotOfHistogram(hist)
currentHistograms[mf.GetName()] = snapshot
@pm-ju
Copy link
Copy Markdown
Contributor Author

pm-ju commented May 18, 2026

can you consistently use same parsing like pkg/kthena-router/backend/metrics/metrics.go

It use expfmt.NewTextParser

Updated, thanks. The autoscaler parser now uses expfmt.NewTextParser(model.UTF8Validation) like pkg/kthena-router/backend/metrics/metrics.go, while keeping the sample aggregation behavior.

@pm-ju
Copy link
Copy Markdown
Contributor Author

pm-ju commented May 18, 2026

image

if anyone know about this failing test could you tell me. I dont think my code touches any part where this particular gateway-api might be failing.

Signed-off-by: pm-ju <pmdevops29@gmail.com>
@hzxuzhonghu
Copy link
Copy Markdown
Member

@pm-ju Looks like a flake

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good can you add test case fro HISTOGRAM metric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Autoscaler only reads the first Prometheus sample for labeled metrics

5 participants