autoscaler: Fix external metric aggregation#1207
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to aggregate external metrics across backends by distinguishing between additive metrics (which are summed) and ratio-like metrics (which are averaged using replica-weighted values). The code reviewer identified a critical bug where ratio-like metrics must be aggregated as a weighted sum rather than a weighted average to prevent incorrect downscaling in the downstream autoscaling algorithm. Additionally, the reviewer pointed out that throughput metrics ending in 'rate' are additive and should not be treated as ratio-like, and suggested a more robust suffix-matching heuristic to avoid false positives. Finally, the reviewer recommended updating the unit tests to align with these corrections.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
f173e50 to
6007e5a
Compare
6007e5a to
76b6935
Compare
|
cc @chenhuiluo please help review |
Signed-off-by: Aamod007 <aamodkumar2006@gmail.com>
2491315 to
3a379b6
Compare
1161334 to
a2ae5b0
Compare
Signed-off-by: Aamod007 <aamodkumar2006@gmail.com>
a2ae5b0 to
82aeb34
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR updates the heterogeneous autoscaler optimizer to support per-metric aggregation semantics for external metrics so that additive metrics (e.g., queue length) can continue to be summed while ratio-like metrics (e.g., utilization/usage) are aggregated differently to avoid invalid “>100%” style totals.
Changes:
- Adds
AggregationType(Sum/Avg) and anaggregationfield toAutoscalingPolicyMetric, and propagates it through generated artifacts (CRDs/docs/applyconfig). - Updates the heterogeneous optimizer to aggregate external metrics using a per-metric strategy (new in-memory aggregation helpers).
- Adds unit tests covering the external-metric aggregation behavior.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/autoscaler/autoscaler/optimizer.go | Replaces unconditional external-metric summation with aggregation-aware collection/finalization. |
| pkg/autoscaler/autoscaler/optimizer_test.go | Adds tests for new external-metric aggregation behavior. |
| pkg/autoscaler/autoscaler/metric_collector.go | Introduces GetMetricAggregations to derive aggregation strategy per metric. |
| pkg/apis/workload/v1alpha1/autoscalingpolicy_types.go | Adds aggregation field + AggregationType enum to the AutoscalingPolicyMetric API. |
| docs/kthena/docs/reference/crd/workload.serving.volcano.sh.md | Documents the new CRD field/type. |
| client-go/applyconfiguration/workload/v1alpha1/autoscalingpolicymetric.go | Adds applyconfig support for aggregation. |
| charts/kthena/charts/workload/crds/workload.serving.volcano.sh_modelboosters.yaml | Updates Helm CRD schema to include aggregation. |
| charts/kthena/charts/workload/crds/workload.serving.volcano.sh_autoscalingpolicies.yaml | Updates Helm CRD schema to include aggregation. |
Files not reviewed (1)
- client-go/applyconfiguration/workload/v1alpha1/autoscalingpolicymetric.go: Generated file
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func finalizeExternalMetrics(aggregates map[string]*externalMetricAggregate) algorithm.Metrics { | ||
| metrics := make(algorithm.Metrics, len(aggregates)) | ||
| for metricName, aggregate := range aggregates { | ||
| if aggregate.aggregationType == workload.AggregationTypeAvg { | ||
| if aggregate.weight > 0 { | ||
| metrics[metricName] = aggregate.weightedSum | ||
| continue | ||
| } | ||
| metrics[metricName] = aggregate.sum / float64(aggregate.count) | ||
| continue | ||
| } | ||
| metrics[metricName] = aggregate.sum | ||
| } | ||
| return metrics | ||
| } |
| { | ||
| name: "ratio metrics use weighted sum by backend replica count", | ||
| samples: []struct { | ||
| metricName string | ||
| aggType workload.AggregationType | ||
| value float64 | ||
| replicas int32 | ||
| }{ | ||
| {metricName: "gpu_utilization", aggType: workload.AggregationTypeAvg, value: 60, replicas: 2}, | ||
| {metricName: "gpu_utilization", aggType: workload.AggregationTypeAvg, value: 90, replicas: 1}, | ||
| }, | ||
| want: map[string]float64{ | ||
| "gpu_utilization": 210, | ||
| }, | ||
| }, |
| func GetMetricAggregations(autoscalePolicy *v1alpha1.AutoscalingPolicy) map[string]v1alpha1.AggregationType { | ||
| metricAggregations := make(map[string]v1alpha1.AggregationType) | ||
| if autoscalePolicy == nil { | ||
| klog.Warning("autoscalePolicy is nil, can't get metricAggregations") | ||
| return metricAggregations | ||
| } | ||
|
|
||
| for _, metric := range autoscalePolicy.Spec.Metrics { | ||
| if metric.Aggregation == "" { | ||
| metricAggregations[metric.Name] = v1alpha1.AggregationTypeSum | ||
| } else { | ||
| metricAggregations[metric.Name] = metric.Aggregation | ||
| } | ||
| } | ||
| return metricAggregations | ||
| } |
| // Aggregation defines how the metric should be aggregated across multiple instances. | ||
| // 'Sum' is used for additive metrics like queue length or request counts. | ||
| // 'Avg' is used for ratio/percentage metrics like GPU utilization or cache usage. | ||
| // +kubebuilder:validation:Enum=Sum;Avg | ||
| // +kubebuilder:default="Sum" | ||
| // +optional | ||
| Aggregation AggregationType `json:"aggregation,omitempty"` |
| func addExternalMetric(aggregates map[string]*externalMetricAggregate, metricName string, metricValue float64, replicas int32, aggType workload.AggregationType) { | ||
| aggregate, ok := aggregates[metricName] | ||
| if !ok { | ||
| aggregate = &externalMetricAggregate{aggregationType: aggType} | ||
| aggregates[metricName] = aggregate | ||
| } | ||
| aggregate.count++ | ||
| if aggType == workload.AggregationTypeAvg { | ||
| if replicas > 0 { | ||
| aggregate.weightedSum += metricValue * float64(replicas) | ||
| aggregate.weight += replicas | ||
| return | ||
| } | ||
| aggregate.sum += metricValue | ||
| return | ||
| } | ||
| aggregate.sum += metricValue | ||
| } |
| // +kubebuilder:validation:Enum=Sum;Avg | ||
| // +kubebuilder:default="Sum" | ||
| // +optional | ||
| Aggregation AggregationType `json:"aggregation,omitempty"` |
There was a problem hiding this comment.
Could we not call aggregationType, prefer MetricTargetType or MetricType in HPA
type MetricTargetType string
const (
// UtilizationMetricType declares a MetricTarget is an AverageUtilization value
UtilizationMetricType MetricTargetType = "Utilization"
// ValueMetricType declares a MetricTarget is a raw value
ValueMetricType MetricTargetType = "Value"
// AverageValueMetricType declares a MetricTarget is an
AverageValueMetricType MetricTargetType = "AverageValue"
)There was a problem hiding this comment.
Because we already introduced support from prometheus as well bny @LiZhenCheng9527
kube-gopher
left a comment
There was a problem hiding this comment.
It is suggested to clarify that Avg is intentionally a weighted sum, and that we track desired = metric/target; simply adding comments directly will suffice.
What type of PR is this?
/kind bug
What this PR does / why we need it:
The heterogeneous optimizer was summing all external metrics across backends. That is correct for additive metrics such as queue length, but it is wrong for ratio-like metrics such as GPU utilization or cache usage. For example, two backends reporting 60% and 70% utilization should not become 130% utilization after aggregation.
This PR keeps additive external metrics as sums and averages ratio-like external metrics by backend replica count. It uses the short-term heuristic from the issue: metric names ending in utilization, usage, percent, percentage, ratio, rate, or perc are treated as ratio-like metrics. Metrics without those suffixes keep the existing sum behavior.
Which issue(s) this PR fixes:
Fixes #1202
Special notes for your reviewer:
N/A
Tested with:
go test ./pkg/autoscaler/...go test $(go list ./... | grep -v /test/e2e | grep -v /client-go)go build ./cmd/kthena-controller-manager ./cmd/kthena-router ./cli/kthenaDoes this PR introduce a user-facing change?: