Skip to content

fix(scheduler): treat zero TTFT/TPOT as uninitialized in least-latency plugin#1040

Open
Sanchit2662 wants to merge 1 commit into
volcano-sh:mainfrom
Sanchit2662:fix/least-latency-cold-pod-score
Open

fix(scheduler): treat zero TTFT/TPOT as uninitialized in least-latency plugin#1040
Sanchit2662 wants to merge 1 commit into
volcano-sh:mainfrom
Sanchit2662:fix/least-latency-cold-pod-score

Conversation

@Sanchit2662
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

So while I was going through the router code to understand how scheduling works (for the benchmark project), I found a bug in the least-latency scoring plugin that's been there since the plugin was added.

Basically when a new pod starts up, its TTFT and TPOT fields are just 0.0 by default since nothing has been measured yet. The metrics scraper already knows this and skips zero values when updating PodInfo. But the scoring plugin doesn't know that - it treats 0 as a real latency value, which means the new pod looks like it has the lowest latency of anyone and gets a perfect score of 100.

So what ends up happening is during a rolling update or scale-out, the fresh cold pod gets all the traffic while the warm pods just sit there doing nothing. Which is exactly what the plugin is supposed to prevent. I noticed this specifically because I was looking at how the min/max normalization works and realized that zero being included in the min would skew every other score.

The fix is small - just change < 0 to <= 0 in calculateMinMaxMetrics so zero values are excluded from the min/max calculation, and in the scoring loop give uninitialized pods a neutral score of 50 instead of accidentally giving them 100. That way new pods still get some traffic through the other plugins but don't monopolize dispatch.

Does this PR introduce a user-facing change?:

Fixed least-latency scheduling plugin incorrectly routing all traffic to cold/new pods during rolling updates. New pods with no latency data yet now get a neutral score instead of the highest score.

…y plugin

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
Copilot AI review requested due to automatic review settings May 13, 2026 16:09
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes least-latency scheduler scoring to avoid treating default zero TTFT/TPOT values as “best latency,” which previously caused cold/new pods to receive disproportionately high scores during rollouts/scale-outs.

Changes:

  • Treat TTFT/TPOT values <= 0 as uninitialized/invalid and exclude them from min/max normalization.
  • Assign pods with uninitialized latency a neutral score (50) instead of allowing them to normalize to an outsized score.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +83 to +88
// Pods with no observed latency yet (zero default) are uninitialized — assign a
// neutral mid-range score so they receive some traffic via other plugins without
// monopolizing dispatch ahead of warm, measured replicas.
if ttft <= 0 || tpot <= 0 {
scoreResults[info] = int(MaxScore / 2)
continue
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the LeastLatency scheduler plugin to assign a neutral score to uninitialized pods (where TTFT or TPOT is zero) and adjusts the min/max metric calculation to exclude these values. The reviewer suggests implementing a more granular scoring approach that utilizes available metrics individually when only one is present, rather than treating pods as uninitialized if either metric is missing, which would improve ranking accuracy.

Comment on lines +86 to +89
if ttft <= 0 || tpot <= 0 {
scoreResults[info] = int(MaxScore / 2)
continue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic treats a pod as fully uninitialized if either TTFT or TPOT is zero, assigning it a neutral score. This approach discards potentially useful information. For instance, a pod with a valid TTFT but no TPOT yet will be scored the same as a completely cold pod.

Consider a more granular scoring approach:

  1. If both metrics are missing, assign the neutral score.
  2. If only one metric is available, calculate the score based on that single metric.
  3. If both are available, use the weighted average.

This would provide a more accurate ranking by utilizing all available latency data. This would also require a corresponding change in calculateMinMaxMetrics to compute min/max for each metric independently.

Comment on lines +115 to 117
if ttft <= 0 || tpot <= 0 {
continue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This condition skips the pod if either metric is uninitialized, which prevents the use of valid data from the other metric when calculating min/max ranges.

To support more granular scoring as suggested for the Score function, this function should be updated to calculate min/max for TTFT and TPOT independently, only considering values greater than zero for each metric's calculation.

// Skip pods with invalid values
if ttft < 0 || tpot < 0 {
// Skip pods with no observed data (zero is the uninitialized default, not a valid measurement)
if ttft <= 0 || tpot <= 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edge case sanity check: if all pods are uninitialized, this function returns (0,0,0,0) for min/max — but that's fine because every pod hits the <= 0 guard in the scoring loop above and gets MaxScore/2. Just confirming the edge case was thought through.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the edge case is fine - all uninitialized pods get 50 and distribute evenly, least-request takes over from there.

// Pods with no observed latency yet (zero default) are uninitialized — assign a
// neutral mid-range score so they receive some traffic via other plugins without
// monopolizing dispatch ahead of warm, measured replicas.
if ttft <= 0 || tpot <= 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using || here means if TTFT has been observed but TPOT hasn't (or vice versa), we still treat the pod as fully uninitialized — is that intentional? Could see an argument for && if you want to score a partially-warmed pod on whatever metric is available. Not a blocker, just want to make sure the choice was deliberate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's intentional. if either metric is missing, the normalized score would be skewed anyway since the min/max calculation also skips pods where either value is zero. mixing a real TTFT with a fake TPOT score felt worse than just giving it 50 and moving on.

@Sanchit2662 Sanchit2662 requested a review from hzxuzhonghu May 14, 2026 11:31
Copy link
Copy Markdown
Contributor

@FAUST-BENCHOU FAUST-BENCHOU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense to me.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we declare a const score when it is uninitiated, instead of skip in L115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants