Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to actor list limit to fix issue 803 #814

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ def operator() -> ActorHandle:
actors = [operator() for _ in range(n_actors)]
for i in range(120):
time.sleep(1)
alive = list_actors(filters=[("class_name", "=", cls_name), ("state", "=", "ALIVE")])
alive = list_actors(filters=[("class_name", "=", cls_name), ("state", "=", "ALIVE")], limit=n_actors + 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 10 the right number. what if a transform's ray runtime starts its own actors? Maybe this should be n_actors*2 or the above with the ability to override with a config value (env var?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The idea to use the number of actors to compute the limit is great, but l think it would be better to use

num_actors * 10

instead of

num_actors + 10

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the downside of setting the limit to 1E6

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gents, If we create 5 actors whats the reason asking for 100000? We are asking only for what we need and padding it with 10

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky 1) what is the downside of using 1E6 and 2) what if the transform creates its own actors? Please advise.

Copy link
Collaborator

@cmadam cmadam Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I don't understand why we need to add 10 to the number of actors in the limit. Why not just set the limit to be equal to the number of actors, since this is what is tested in line 117.

For example, if there are 5 actors, why would we ever need a limit greater than 5? Can it happen that list_actors() will return 6 actors, if we only intended to create 5? And, in that case, do we need the run to fail because there are more actors available than what we requested to create?

And again, I am not clear why are we testing for exact equality in line 117. Would we be ok to proceed even if only 80% of the actors are alive? Or is there a reason why we need all the actors to be up when we start? I am thinking of a scenario when we request 10 worker groups, but only 9 are allocated, and one is pending due to lack of resources. Would it be ok to proceed with the execution in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, I am a bit concerned to use the same number

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that I found today, max limit supported is 10000

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmadam, @daw3rd any progress on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of created actors is greater than the limit, the function create_actors() should still return the actor handles

if len(actors) == len(alive):
return actors
# failed - raise an exception
Expand Down