autoscaler: Fix external metric aggregation by Aamod007 · Pull Request #1207 · volcano-sh/kthena

Aamod007 · 2026-06-15T14:55:31Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The heterogeneous optimizer was summing all external metrics across backends. That is correct for additive metrics such as queue length, but it is wrong for ratio-like metrics such as GPU utilization or cache usage. For example, two backends reporting 60% and 70% utilization should not become 130% utilization after aggregation.

This PR keeps additive external metrics as sums and averages ratio-like external metrics by backend replica count. It uses the short-term heuristic from the issue: metric names ending in utilization, usage, percent, percentage, ratio, rate, or perc are treated as ratio-like metrics. Metrics without those suffixes keep the existing sum behavior.

Which issue(s) this PR fixes:

Fixes #1202

Special notes for your reviewer:

N/A

Tested with:

go test ./pkg/autoscaler/...
go test $(go list ./... | grep -v /test/e2e | grep -v /client-go)
go build ./cmd/kthena-controller-manager ./cmd/kthena-router ./cli/kthena

Does this PR introduce a user-facing change?:

NONE

gemini-code-assist

Code Review

This pull request introduces a mechanism to aggregate external metrics across backends by distinguishing between additive metrics (which are summed) and ratio-like metrics (which are averaged using replica-weighted values). The code reviewer identified a critical bug where ratio-like metrics must be aggregated as a weighted sum rather than a weighted average to prevent incorrect downscaling in the downstream autoscaling algorithm. Additionally, the reviewer pointed out that throughput metrics ending in 'rate' are additive and should not be treated as ratio-like, and suggested a more robust suffix-matching heuristic to avoid false positives. Finally, the reviewer recommended updating the unit tests to align with these corrections.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

hzxuzhonghu · 2026-06-16T02:16:50Z

cc @chenhuiluo please help review

Signed-off-by: Aamod007 <aamodkumar2006@gmail.com>

volcano-sh-bot · 2026-06-16T05:12:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR updates the heterogeneous autoscaler optimizer to support per-metric aggregation semantics for external metrics so that additive metrics (e.g., queue length) can continue to be summed while ratio-like metrics (e.g., utilization/usage) are aggregated differently to avoid invalid “>100%” style totals.

Changes:

Adds AggregationType (Sum/Avg) and an aggregation field to AutoscalingPolicyMetric, and propagates it through generated artifacts (CRDs/docs/applyconfig).
Updates the heterogeneous optimizer to aggregate external metrics using a per-metric strategy (new in-memory aggregation helpers).
Adds unit tests covering the external-metric aggregation behavior.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
pkg/autoscaler/autoscaler/optimizer.go	Replaces unconditional external-metric summation with aggregation-aware collection/finalization.
pkg/autoscaler/autoscaler/optimizer_test.go	Adds tests for new external-metric aggregation behavior.
pkg/autoscaler/autoscaler/metric_collector.go	Introduces `GetMetricAggregations` to derive aggregation strategy per metric.
pkg/apis/workload/v1alpha1/autoscalingpolicy_types.go	Adds `aggregation` field + `AggregationType` enum to the AutoscalingPolicyMetric API.
docs/kthena/docs/reference/crd/workload.serving.volcano.sh.md	Documents the new CRD field/type.
client-go/applyconfiguration/workload/v1alpha1/autoscalingpolicymetric.go	Adds applyconfig support for `aggregation`.
charts/kthena/charts/workload/crds/workload.serving.volcano.sh_modelboosters.yaml	Updates Helm CRD schema to include `aggregation`.
charts/kthena/charts/workload/crds/workload.serving.volcano.sh_autoscalingpolicies.yaml	Updates Helm CRD schema to include `aggregation`.

Files not reviewed (1)

client-go/applyconfiguration/workload/v1alpha1/autoscalingpolicymetric.go: Generated file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+func finalizeExternalMetrics(aggregates map[string]*externalMetricAggregate) algorithm.Metrics {
+	metrics := make(algorithm.Metrics, len(aggregates))
+	for metricName, aggregate := range aggregates {
+		if aggregate.aggregationType == workload.AggregationTypeAvg {
+			if aggregate.weight > 0 {
+				metrics[metricName] = aggregate.weightedSum
+				continue
+			}
+			metrics[metricName] = aggregate.sum / float64(aggregate.count)
+			continue
+		}
+		metrics[metricName] = aggregate.sum
+	}
+	return metrics
+}


+		{
+			name: "ratio metrics use weighted sum by backend replica count",
+			samples: []struct {
+				metricName string
+				aggType    workload.AggregationType
+				value      float64
+				replicas   int32
+			}{
+				{metricName: "gpu_utilization", aggType: workload.AggregationTypeAvg, value: 60, replicas: 2},
+				{metricName: "gpu_utilization", aggType: workload.AggregationTypeAvg, value: 90, replicas: 1},
+			},
+			want: map[string]float64{
+				"gpu_utilization": 210,
+			},
+		},


+func GetMetricAggregations(autoscalePolicy *v1alpha1.AutoscalingPolicy) map[string]v1alpha1.AggregationType {
+	metricAggregations := make(map[string]v1alpha1.AggregationType)
+	if autoscalePolicy == nil {
+		klog.Warning("autoscalePolicy is nil, can't get metricAggregations")
+		return metricAggregations
+	}
+
+	for _, metric := range autoscalePolicy.Spec.Metrics {
+		if metric.Aggregation == "" {
+			metricAggregations[metric.Name] = v1alpha1.AggregationTypeSum
+		} else {
+			metricAggregations[metric.Name] = metric.Aggregation
+		}
+	}
+	return metricAggregations
+}


+	// Aggregation defines how the metric should be aggregated across multiple instances.
+	// 'Sum' is used for additive metrics like queue length or request counts.
+	// 'Avg' is used for ratio/percentage metrics like GPU utilization or cache usage.
+	// +kubebuilder:validation:Enum=Sum;Avg
+	// +kubebuilder:default="Sum"
+	// +optional
+	Aggregation AggregationType `json:"aggregation,omitempty"`


+func addExternalMetric(aggregates map[string]*externalMetricAggregate, metricName string, metricValue float64, replicas int32, aggType workload.AggregationType) {
+	aggregate, ok := aggregates[metricName]
+	if !ok {
+		aggregate = &externalMetricAggregate{aggregationType: aggType}
+		aggregates[metricName] = aggregate
+	}
+	aggregate.count++
+	if aggType == workload.AggregationTypeAvg {
+		if replicas > 0 {
+			aggregate.weightedSum += metricValue * float64(replicas)
+			aggregate.weight += replicas
+			return
+		}
+		aggregate.sum += metricValue
+		return
+	}
+	aggregate.sum += metricValue
+}


hzxuzhonghu · 2026-06-17T06:24:41Z

+	// +kubebuilder:validation:Enum=Sum;Avg
+	// +kubebuilder:default="Sum"
+	// +optional
+	Aggregation AggregationType `json:"aggregation,omitempty"`


Could we not call aggregationType, prefer MetricTargetType or MetricType in HPA

type MetricTargetType string const ( // UtilizationMetricType declares a MetricTarget is an AverageUtilization value UtilizationMetricType MetricTargetType = "Utilization" // ValueMetricType declares a MetricTarget is a raw value ValueMetricType MetricTargetType = "Value" // AverageValueMetricType declares a MetricTarget is an AverageValueMetricType MetricTargetType = "AverageValue" )

Because we already introduced support from prometheus as well bny @LiZhenCheng9527

kube-gopher

It is suggested to clarify that Avg is intentionally a weighted sum, and that we track desired = metric/target; simply adding comments directly will suffice.

volcano-sh-bot added the kind/bug label Jun 15, 2026

volcano-sh-bot requested review from LiZhenCheng9527 and hzxuzhonghu June 15, 2026 14:55

volcano-sh-bot added the size/L label Jun 15, 2026

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread pkg/autoscaler/autoscaler/optimizer.go

Comment thread pkg/autoscaler/autoscaler/optimizer.go Outdated

Comment thread pkg/autoscaler/autoscaler/optimizer_test.go

Comment thread pkg/autoscaler/autoscaler/optimizer_test.go Outdated

volcano-sh-bot added the do-not-merge/invalid-commit-message label Jun 15, 2026

Aamod007 force-pushed the codex/fix-external-metric-aggregation branch from f173e50 to 6007e5a Compare June 15, 2026 15:02

volcano-sh-bot removed the do-not-merge/invalid-commit-message label Jun 15, 2026

Aamod007 force-pushed the codex/fix-external-metric-aggregation branch from 6007e5a to 76b6935 Compare June 15, 2026 15:55

kube-gopher reviewed Jun 15, 2026

View reviewed changes

Comment thread pkg/autoscaler/autoscaler/optimizer.go

volcano-sh-bot added the do-not-merge/contains-merge-commits label Jun 16, 2026

autoscaler: Fix external metric aggregation

5936ce1

Signed-off-by: Aamod007 <aamodkumar2006@gmail.com>

Aamod007 force-pushed the codex/fix-external-metric-aggregation branch from 2491315 to 3a379b6 Compare June 16, 2026 04:35

volcano-sh-bot removed the do-not-merge/contains-merge-commits label Jun 16, 2026

Aamod007 force-pushed the codex/fix-external-metric-aggregation branch 2 times, most recently from 1161334 to a2ae5b0 Compare June 16, 2026 04:45

feat: add explicit aggregation type to AutoscalingPolicyMetric CRD

82aeb34

Signed-off-by: Aamod007 <aamodkumar2006@gmail.com>

Aamod007 force-pushed the codex/fix-external-metric-aggregation branch from a2ae5b0 to 82aeb34 Compare June 16, 2026 05:12

hzxuzhonghu requested a review from Copilot June 16, 2026 06:58

Copilot started reviewing on behalf of hzxuzhonghu June 16, 2026 06:59 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

hzxuzhonghu reviewed Jun 17, 2026

View reviewed changes

kube-gopher reviewed Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autoscaler: Fix external metric aggregation#1207

autoscaler: Fix external metric aggregation#1207
Aamod007 wants to merge 2 commits into
volcano-sh:mainfrom
Aamod007:codex/fix-external-metric-aggregation

Aamod007 commented Jun 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzxuzhonghu commented Jun 16, 2026

Uh oh!

volcano-sh-bot commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

hzxuzhonghu Jun 17, 2026

Uh oh!

hzxuzhonghu Jun 17, 2026

Uh oh!

kube-gopher left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Aamod007 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzxuzhonghu commented Jun 16, 2026

Uh oh!

volcano-sh-bot commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

hzxuzhonghu Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

kube-gopher left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Aamod007 commented Jun 15, 2026 •

edited

Loading