Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][RayService] deflaky the TestAutoscalingRayService #3119

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Feb 26, 2025

Why are these changes needed?

After digging into the log of the autoscaler container, I found the issue was a race between the autoscaler and scheduler:

The log is too big, I will use screenshots to explain the event sequence:

  1. L738 2025-02-26 00:12:01,077: The autoscaler found a new resource demand {'CPU': 0.5}.
Image
  1. L787 2025-02-26 00:12:01,079: The autoscaler created a new node for the demand by pathing the replicas to 1. Note that the demand was still kept around.
Image
  1. L1699 2025-02-26 00:12:22.067: The new node was ready, but the demand hadn't been resolved by the scheduler.
Image
  1. L1749 2025-02-26 00:12:22,070: Just before the scheduler resolved the demand with the new node, the autoscaler thought the demand was a new demand (it was not), therefore it created another new node by pathing the replicas to 2.
Image
  1. L1788 2025-02-26 00:12:23.14: The demand had been resolved by the scheduler finally but an extra new node had been created.
Image
  1. L5264 2025-02-26 00:13:34,775: 1 and more minute later, the extra node got drained by autoscaler:
Image

The full log: rayautoscalerlog.txt

I think the key point is that currently, the autoscaler can't tell whether a resource demand is old or new. Fixing that may not be easy and the fix can only work for future Ray versions. For KubeRay now, I think we just make the test tolerant to multiple worker pods by extending to timeout from TestTimeoutShort to TestTimeoutLong.

Related issue number

#2981

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@rueian rueian force-pushed the deflaky-rayservice-autoscaler-test branch from edadaa8 to 9ab4725 Compare February 27, 2025 02:20
@rueian rueian marked this pull request as ready for review February 27, 2025 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant