Skip to content

autoscaler: Fix external metric aggregation#1207

Open
Aamod007 wants to merge 2 commits into
volcano-sh:mainfrom
Aamod007:codex/fix-external-metric-aggregation
Open

autoscaler: Fix external metric aggregation#1207
Aamod007 wants to merge 2 commits into
volcano-sh:mainfrom
Aamod007:codex/fix-external-metric-aggregation

Conversation

@Aamod007

@Aamod007 Aamod007 commented Jun 15, 2026

Copy link
Copy Markdown

What type of PR is this?

/kind bug

What this PR does / why we need it:

The heterogeneous optimizer was summing all external metrics across backends. That is correct for additive metrics such as queue length, but it is wrong for ratio-like metrics such as GPU utilization or cache usage. For example, two backends reporting 60% and 70% utilization should not become 130% utilization after aggregation.

This PR keeps additive external metrics as sums and averages ratio-like external metrics by backend replica count. It uses the short-term heuristic from the issue: metric names ending in utilization, usage, percent, percentage, ratio, rate, or perc are treated as ratio-like metrics. Metrics without those suffixes keep the existing sum behavior.

Which issue(s) this PR fixes:

Fixes #1202

Special notes for your reviewer:

N/A

Tested with:

  • go test ./pkg/autoscaler/...
  • go test $(go list ./... | grep -v /test/e2e | grep -v /client-go)
  • go build ./cmd/kthena-controller-manager ./cmd/kthena-router ./cli/kthena

Does this PR introduce a user-facing change?:

NONE

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to aggregate external metrics across backends by distinguishing between additive metrics (which are summed) and ratio-like metrics (which are averaged using replica-weighted values). The code reviewer identified a critical bug where ratio-like metrics must be aggregated as a weighted sum rather than a weighted average to prevent incorrect downscaling in the downstream autoscaling algorithm. Additionally, the reviewer pointed out that throughput metrics ending in 'rate' are additive and should not be treated as ratio-like, and suggested a more robust suffix-matching heuristic to avoid false positives. Finally, the reviewer recommended updating the unit tests to align with these corrections.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/autoscaler/autoscaler/optimizer.go
Comment thread pkg/autoscaler/autoscaler/optimizer.go Outdated
Comment thread pkg/autoscaler/autoscaler/optimizer_test.go
Comment thread pkg/autoscaler/autoscaler/optimizer_test.go Outdated
Comment thread pkg/autoscaler/autoscaler/optimizer.go
@hzxuzhonghu

Copy link
Copy Markdown
Member

cc @chenhuiluo please help review

Signed-off-by: Aamod007 <aamodkumar2006@gmail.com>
@Aamod007 Aamod007 force-pushed the codex/fix-external-metric-aggregation branch from 2491315 to 3a379b6 Compare June 16, 2026 04:35
@Aamod007 Aamod007 force-pushed the codex/fix-external-metric-aggregation branch 2 times, most recently from 1161334 to a2ae5b0 Compare June 16, 2026 04:45
Signed-off-by: Aamod007 <aamodkumar2006@gmail.com>
@Aamod007 Aamod007 force-pushed the codex/fix-external-metric-aggregation branch from a2ae5b0 to 82aeb34 Compare June 16, 2026 05:12
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the heterogeneous autoscaler optimizer to support per-metric aggregation semantics for external metrics so that additive metrics (e.g., queue length) can continue to be summed while ratio-like metrics (e.g., utilization/usage) are aggregated differently to avoid invalid “>100%” style totals.

Changes:

  • Adds AggregationType (Sum/Avg) and an aggregation field to AutoscalingPolicyMetric, and propagates it through generated artifacts (CRDs/docs/applyconfig).
  • Updates the heterogeneous optimizer to aggregate external metrics using a per-metric strategy (new in-memory aggregation helpers).
  • Adds unit tests covering the external-metric aggregation behavior.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/autoscaler/autoscaler/optimizer.go Replaces unconditional external-metric summation with aggregation-aware collection/finalization.
pkg/autoscaler/autoscaler/optimizer_test.go Adds tests for new external-metric aggregation behavior.
pkg/autoscaler/autoscaler/metric_collector.go Introduces GetMetricAggregations to derive aggregation strategy per metric.
pkg/apis/workload/v1alpha1/autoscalingpolicy_types.go Adds aggregation field + AggregationType enum to the AutoscalingPolicyMetric API.
docs/kthena/docs/reference/crd/workload.serving.volcano.sh.md Documents the new CRD field/type.
client-go/applyconfiguration/workload/v1alpha1/autoscalingpolicymetric.go Adds applyconfig support for aggregation.
charts/kthena/charts/workload/crds/workload.serving.volcano.sh_modelboosters.yaml Updates Helm CRD schema to include aggregation.
charts/kthena/charts/workload/crds/workload.serving.volcano.sh_autoscalingpolicies.yaml Updates Helm CRD schema to include aggregation.
Files not reviewed (1)
  • client-go/applyconfiguration/workload/v1alpha1/autoscalingpolicymetric.go: Generated file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +247 to +261
func finalizeExternalMetrics(aggregates map[string]*externalMetricAggregate) algorithm.Metrics {
metrics := make(algorithm.Metrics, len(aggregates))
for metricName, aggregate := range aggregates {
if aggregate.aggregationType == workload.AggregationTypeAvg {
if aggregate.weight > 0 {
metrics[metricName] = aggregate.weightedSum
continue
}
metrics[metricName] = aggregate.sum / float64(aggregate.count)
continue
}
metrics[metricName] = aggregate.sum
}
return metrics
}
Comment on lines +222 to +236
{
name: "ratio metrics use weighted sum by backend replica count",
samples: []struct {
metricName string
aggType workload.AggregationType
value float64
replicas int32
}{
{metricName: "gpu_utilization", aggType: workload.AggregationTypeAvg, value: 60, replicas: 2},
{metricName: "gpu_utilization", aggType: workload.AggregationTypeAvg, value: 90, replicas: 1},
},
want: map[string]float64{
"gpu_utilization": 210,
},
},
Comment on lines +102 to +117
func GetMetricAggregations(autoscalePolicy *v1alpha1.AutoscalingPolicy) map[string]v1alpha1.AggregationType {
metricAggregations := make(map[string]v1alpha1.AggregationType)
if autoscalePolicy == nil {
klog.Warning("autoscalePolicy is nil, can't get metricAggregations")
return metricAggregations
}

for _, metric := range autoscalePolicy.Spec.Metrics {
if metric.Aggregation == "" {
metricAggregations[metric.Name] = v1alpha1.AggregationTypeSum
} else {
metricAggregations[metric.Name] = metric.Aggregation
}
}
return metricAggregations
}
Comment on lines +49 to +55
// Aggregation defines how the metric should be aggregated across multiple instances.
// 'Sum' is used for additive metrics like queue length or request counts.
// 'Avg' is used for ratio/percentage metrics like GPU utilization or cache usage.
// +kubebuilder:validation:Enum=Sum;Avg
// +kubebuilder:default="Sum"
// +optional
Aggregation AggregationType `json:"aggregation,omitempty"`
Comment on lines +228 to +245
func addExternalMetric(aggregates map[string]*externalMetricAggregate, metricName string, metricValue float64, replicas int32, aggType workload.AggregationType) {
aggregate, ok := aggregates[metricName]
if !ok {
aggregate = &externalMetricAggregate{aggregationType: aggType}
aggregates[metricName] = aggregate
}
aggregate.count++
if aggType == workload.AggregationTypeAvg {
if replicas > 0 {
aggregate.weightedSum += metricValue * float64(replicas)
aggregate.weight += replicas
return
}
aggregate.sum += metricValue
return
}
aggregate.sum += metricValue
}
// +kubebuilder:validation:Enum=Sum;Avg
// +kubebuilder:default="Sum"
// +optional
Aggregation AggregationType `json:"aggregation,omitempty"`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not call aggregationType, prefer MetricTargetType or MetricType in HPA

type MetricTargetType string

const (
	// UtilizationMetricType declares a MetricTarget is an AverageUtilization value
	UtilizationMetricType MetricTargetType = "Utilization"
	// ValueMetricType declares a MetricTarget is a raw value
	ValueMetricType MetricTargetType = "Value"
	// AverageValueMetricType declares a MetricTarget is an
	AverageValueMetricType MetricTargetType = "AverageValue"
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we already introduced support from prometheus as well bny @LiZhenCheng9527

@kube-gopher kube-gopher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is suggested to clarify that Avg is intentionally a weighted sum, and that we track desired = metric/target; simply adding comments directly will suffice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Heterogeneous optimizer aggregates ratio metrics with plain sum, causing incorrect scaling decisions

5 participants