fix(scheduler): treat zero TTFT/TPOT as uninitialized in least-latency plugin by Sanchit2662 · Pull Request #1040 · volcano-sh/kthena

Sanchit2662 · 2026-05-13T16:09:31Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

So while I was going through the router code to understand how scheduling works (for the benchmark project), I found a bug in the least-latency scoring plugin that's been there since the plugin was added.

Basically when a new pod starts up, its TTFT and TPOT fields are just 0.0 by default since nothing has been measured yet. The metrics scraper already knows this and skips zero values when updating PodInfo. But the scoring plugin doesn't know that - it treats 0 as a real latency value, which means the new pod looks like it has the lowest latency of anyone and gets a perfect score of 100.

So what ends up happening is during a rolling update or scale-out, the fresh cold pod gets all the traffic while the warm pods just sit there doing nothing. Which is exactly what the plugin is supposed to prevent. I noticed this specifically because I was looking at how the min/max normalization works and realized that zero being included in the min would skew every other score.

The fix is small - just change < 0 to <= 0 in calculateMinMaxMetrics so zero values are excluded from the min/max calculation, and in the scoring loop give uninitialized pods a neutral score of 50 instead of accidentally giving them 100. That way new pods still get some traffic through the other plugins but don't monopolize dispatch.

Does this PR introduce a user-facing change?:

Fixed least-latency scheduling plugin incorrectly routing all traffic to cold/new pods during rolling updates. New pods with no latency data yet now get a neutral score instead of the highest score.

…y plugin Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>

volcano-sh-bot · 2026-05-13T16:09:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Fixes least-latency scheduler scoring to avoid treating default zero TTFT/TPOT values as “best latency,” which previously caused cold/new pods to receive disproportionately high scores during rollouts/scale-outs.

Changes:

Treat TTFT/TPOT values <= 0 as uninitialized/invalid and exclude them from min/max normalization.
Assign pods with uninitialized latency a neutral score (50) instead of allowing them to normalize to an outsized score.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+		// Pods with no observed latency yet (zero default) are uninitialized — assign a
+		// neutral mid-range score so they receive some traffic via other plugins without
+		// monopolizing dispatch ahead of warm, measured replicas.
+		if ttft <= 0 || tpot <= 0 {
+			scoreResults[info] = int(MaxScore / 2)
+			continue


gemini-code-assist

Code Review

This pull request updates the LeastLatency scheduler plugin to assign a neutral score to uninitialized pods (where TTFT or TPOT is zero) and adjusts the min/max metric calculation to exclude these values. The reviewer suggests implementing a more granular scoring approach that utilizes available metrics individually when only one is present, rather than treating pods as uninitialized if either metric is missing, which would improve ranking accuracy.

gemini-code-assist · 2026-05-13T16:13:53Z

+		if ttft <= 0 || tpot <= 0 {
+			scoreResults[info] = int(MaxScore / 2)
+			continue
+		}


This logic treats a pod as fully uninitialized if either TTFT or TPOT is zero, assigning it a neutral score. This approach discards potentially useful information. For instance, a pod with a valid TTFT but no TPOT yet will be scored the same as a completely cold pod.

Consider a more granular scoring approach:

If both metrics are missing, assign the neutral score.

If only one metric is available, calculate the score based on that single metric.

If both are available, use the weighted average.

This would provide a more accurate ranking by utilizing all available latency data. This would also require a corresponding change in calculateMinMaxMetrics to compute min/max for each metric independently.

gemini-code-assist · 2026-05-13T16:13:53Z

+		if ttft <= 0 || tpot <= 0 {
 			continue
 		}


This condition skips the pod if either metric is uninitialized, which prevents the use of valid data from the other metric when calculating min/max ranges.

To support more granular scoring as suggested for the Score function, this function should be updated to calculate min/max for TTFT and TPOT independently, only considering values greater than zero for each metric's calculation.

hzxuzhonghu · 2026-05-14T07:09:20Z

-		// Skip pods with invalid values
-		if ttft < 0 || tpot < 0 {
+		// Skip pods with no observed data (zero is the uninitialized default, not a valid measurement)
+		if ttft <= 0 || tpot <= 0 {


edge case sanity check: if all pods are uninitialized, this function returns (0,0,0,0) for min/max — but that's fine because every pod hits the <= 0 guard in the scoring loop above and gets MaxScore/2. Just confirming the edge case was thought through.

yeah the edge case is fine - all uninitialized pods get 50 and distribute evenly, least-request takes over from there.

hzxuzhonghu · 2026-05-14T07:09:20Z

+		// Pods with no observed latency yet (zero default) are uninitialized — assign a
+		// neutral mid-range score so they receive some traffic via other plugins without
+		// monopolizing dispatch ahead of warm, measured replicas.
+		if ttft <= 0 || tpot <= 0 {


using || here means if TTFT has been observed but TPOT hasn't (or vice versa), we still treat the pod as fully uninitialized — is that intentional? Could see an argument for && if you want to score a partially-warmed pod on whatever metric is available. Not a blocker, just want to make sure the choice was deliberate.

yeah it's intentional. if either metric is missing, the normalized score would be skewed anyway since the min/max calculation also skips pods where either value is zero. mixing a real TTFT with a fake TPOT score felt worse than just giving it 50 and moving on.

FAUST-BENCHOU

make sense to me.

hzxuzhonghu

should we declare a const score when it is uninitiated, instead of skip in L115

fix(scheduler): treat zero TTFT/TPOT as uninitialized in least-latenc…

690ab62

…y plugin Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>

Copilot AI review requested due to automatic review settings May 13, 2026 16:09

volcano-sh-bot added the kind/bug label May 13, 2026

volcano-sh-bot requested review from git-malu and hzxuzhonghu May 13, 2026 16:09

volcano-sh-bot added the size/S label May 13, 2026

Copilot started reviewing on behalf of Sanchit2662 May 13, 2026 16:10 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

hzxuzhonghu reviewed May 14, 2026

View reviewed changes

Sanchit2662 requested a review from hzxuzhonghu May 14, 2026 11:31

FAUST-BENCHOU reviewed May 16, 2026

View reviewed changes

hzxuzhonghu reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): treat zero TTFT/TPOT as uninitialized in least-latency plugin#1040

fix(scheduler): treat zero TTFT/TPOT as uninitialized in least-latency plugin#1040
Sanchit2662 wants to merge 1 commit into
volcano-sh:mainfrom
Sanchit2662:fix/least-latency-cold-pod-score

Sanchit2662 commented May 13, 2026

Uh oh!

volcano-sh-bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

hzxuzhonghu May 14, 2026

Uh oh!

Sanchit2662 May 14, 2026

Uh oh!

hzxuzhonghu May 14, 2026

Uh oh!

Sanchit2662 May 14, 2026

Uh oh!

FAUST-BENCHOU left a comment

Uh oh!

hzxuzhonghu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Sanchit2662 commented May 13, 2026

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?:

Uh oh!

volcano-sh-bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Sanchit2662 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Sanchit2662 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

FAUST-BENCHOU left a comment

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants