fix: aggregate labeled autoscaler metrics#1064
Conversation
Signed-off-by: pm-ju <pmdevops29@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request improves the metric collector by iterating through and aggregating all labeled samples within a Prometheus MetricFamily for COUNTER and GAUGE types, ensuring a more complete data set. It also introduces nil checks for metric values and adds a unit test to verify the aggregation of labeled samples. Feedback highlights an inconsistency in the handling of HISTOGRAM types, which still only process the first sample and may lead to incomplete data if multiple labeled histograms are present.
| case io_prometheus_client.MetricType_HISTOGRAM: | ||
| metric := mf.Metric[0] | ||
| if metric.GetHistogram() == nil { | ||
| klog.Errorf("histogram metric %s is missing histogram value", mf.GetName()) | ||
| continue | ||
| } | ||
| hist := metric.GetHistogram() | ||
| snapshot := histogram.NewSnapshotOfHistogram(hist) | ||
| currentHistograms[mf.GetName()] = snapshot |
There was a problem hiding this comment.
The handling of MetricType_HISTOGRAM is inconsistent with the improvements made for COUNTER and GAUGE. While the latter now aggregate all labeled samples in a MetricFamily, histograms still only process the first sample (mf.Metric[0]).
If a metric family contains multiple labeled histograms, all samples after the first are ignored, which leads to the "incomplete data" issue described in the PR for other types. Furthermore, if the first sample happens to be malformed, the entire family is skipped even if subsequent samples are valid.
While merging histograms is more complex than summing scalars, consider at least iterating to find the first valid sample, or ideally, implementing a merge strategy to ensure the autoscaler has a complete view of the metric.
you’re right. Blindly summing all labeled samples is not always safe, especially for gauges/ratios. My intent was additive metrics like request counters, but this should be narrowed or made explicit. Would it be okay if I limit this PR to counters only and leave gauges/histograms unchanged? |
i'm not familiar with this part, maybe wait for community's feedback |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
can you consistently use same parsing like pkg/kthena-router/backend/metrics/metrics.go
It use expfmt.NewTextParser
Signed-off-by: pm-ju <pmdevops29@gmail.com>
…etheus-sample-aggregation Signed-off-by: pm-ju <pmdevops29@gmail.com> # Conflicts: # pkg/autoscaler/autoscaler/metric_collector.go
|
Adding label DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| parser := expfmt.NewTextParser(model.UTF8Validation) | ||
| metricFamilies, err := parser.TextToMetricFamilies(strings.NewReader(metricStr)) | ||
| if err != nil { | ||
| klog.Errorf("error parsing metric families: %v", err) |
| case dto.MetricType_HISTOGRAM: | ||
| metric := mf.Metric[0] | ||
| if metric.GetHistogram() == nil { | ||
| klog.Errorf("histogram metric %s is missing histogram value", mf.GetName()) | ||
| continue | ||
| } | ||
| hist := metric.GetHistogram() | ||
| snapshot := histogram.NewSnapshotOfHistogram(hist) | ||
| currentHistograms[mf.GetName()] = snapshot |
Updated, thanks. The autoscaler parser now uses |
Signed-off-by: pm-ju <pmdevops29@gmail.com>
|
@pm-ju Looks like a flake |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
good can you add test case fro HISTOGRAM metric

/kind bug
The autoscaler Prometheus parser only used the first sample in each MetricFamily (
mf.Metric[0]).For labeled counter/gauge metrics, Prometheus returns multiple samples under the same metric family. Ignoring all samples after the first can undercount watched autoscaling metrics and make scaling decisions use incomplete data.
Fixes #1063
This change: