[CI][RayService] deflaky the TestAutoscalingRayService #3119

rueian · 2025-02-26T23:42:15Z

Why are these changes needed?

After digging into the log of the autoscaler container, I found the issue was a race between the autoscaler and scheduler:

The log is too big, I will use screenshots to explain the event sequence:

L738 2025-02-26 00:12:01,077: The autoscaler found a new resource demand {'CPU': 0.5}.

L787 2025-02-26 00:12:01,079: The autoscaler created a new node for the demand by pathing the replicas to 1. Note that the demand was still kept around.

L1699 2025-02-26 00:12:22.067: The new node was ready, but the demand hadn't been resolved by the scheduler.

L1749 2025-02-26 00:12:22,070: Just before the scheduler resolved the demand with the new node, the autoscaler thought the demand was a new demand (it was not), therefore it created another new node by pathing the replicas to 2.

L1788 2025-02-26 00:12:23.14: The demand had been resolved by the scheduler finally but an extra new node had been created.

L5264 2025-02-26 00:13:34,775: 1 and more minute later, the extra node got drained by autoscaler:

The full log: rayautoscalerlog.txt

I think the key point is that currently, the autoscaler can't tell whether a resource demand is old or new. Fixing that may not be easy and the fix can only work for future Ray versions. For KubeRay now, I think we just make the test tolerant to multiple worker pods by extending to timeout from TestTimeoutShort to TestTimeoutLong.

Related issue number

#2981

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: Rueian <[email protected]>

[CI][RayService] deflaky the TestAutoscalingRayService

9ab4725

Signed-off-by: Rueian <[email protected]>

rueian force-pushed the deflaky-rayservice-autoscaler-test branch from edadaa8 to 9ab4725 Compare February 27, 2025 02:20

rueian marked this pull request as ready for review February 27, 2025 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][RayService] deflaky the TestAutoscalingRayService #3119

[CI][RayService] deflaky the TestAutoscalingRayService #3119

rueian commented Feb 26, 2025 •

edited

Loading

[CI][RayService] deflaky the TestAutoscalingRayService #3119

Are you sure you want to change the base?

[CI][RayService] deflaky the TestAutoscalingRayService #3119

Conversation

rueian commented Feb 26, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

rueian commented Feb 26, 2025 •

edited

Loading