[CI][RayService] deflaky the TestAutoscalingRayService #3119
+4
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
After digging into the log of the autoscaler container, I found the issue was a race between the autoscaler and scheduler:
The log is too big, I will use screenshots to explain the event sequence:
L738
2025-02-26 00:12:01,077
: The autoscaler found a new resource demand{'CPU': 0.5}
.L787
2025-02-26 00:12:01,079
: The autoscaler created a new node for the demand by pathing thereplicas
to 1. Note that the demand was still kept around.L1699
2025-02-26 00:12:22.067
: The new node was ready, but the demand hadn't been resolved by the scheduler.L1749
2025-02-26 00:12:22,070
: Just before the scheduler resolved the demand with the new node, the autoscaler thought the demand was a new demand (it was not), therefore it created another new node by pathing thereplicas
to 2.L1788
2025-02-26 00:12:23.14
: The demand had been resolved by the scheduler finally but an extra new node had been created.L5264
2025-02-26 00:13:34,775
: 1 and more minute later, the extra node got drained by autoscaler:The full log: rayautoscalerlog.txt
I think the key point is that currently, the autoscaler can't tell whether a resource demand is old or new. Fixing that may not be easy and the fix can only work for future Ray versions. For KubeRay now, I think we just make the test tolerant to multiple worker pods by extending to timeout from TestTimeoutShort to TestTimeoutLong.
Related issue number
#2981
Checks