Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Only up to 100 actors can be used #858

Open
2 tasks done
dtsuzuku-ibm opened this issue Dec 5, 2024 · 2 comments
Open
2 tasks done

[Bug] Only up to 100 actors can be used #858

dtsuzuku-ibm opened this issue Dec 5, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@dtsuzuku-ibm
Copy link
Collaborator

Search before asking

  • I searched the issues and found no similar issues.

Component

Library/core

What happened + What you expected to happen

When setting num_workers as greater than 100, ray job finishes with following exception (found by @shivdeep-singh-ibm)

data_processing.utils.unrecoverable.UnrecoverableException: out of 200 created actors only 100 alive

It seems by default listing actors is limited to 100
https://github.com/ray-project/ray/blob/1b13782bc7702fbd7af2c89aff293acc4ff49727/python/ray/util/state/api.py#L784

Reproduction script

import ray
from data_processing_ray.runtime.ray import  RayUtils


class Dummy():
   def __init__(self, message):
      print ("Created Actor: ", message)
      
@ray.remote(scheduling_strategy="SPREAD")
class DummyActor(Dummy):
    def __init__(self, params: dict):
        super().__init__(params["message"])
        
params = {
         "message": "hey ya!!"
            }
            
processors = RayUtils.create_actors(
            clazz=DummyActor,
            params=params,
            actor_options={
                "num_cpus":1
                },
            n_actors=200,
        ) 

Anything else

No response

OS

Red Hat Enterprise Linux (RHEL)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@dtsuzuku-ibm
Copy link
Collaborator Author

Fixing in: #839

@shivdeep-singh-ibm
Copy link
Collaborator

I have also observed crash like:

    processors = RayUtils.create_actors(
  File "/home/ray/data-processing-lib-ray/src/data_processing_ray/runtime/ray/ray_utils.py", line 121, in create_actors
    raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")


data_processing.utils.unrecoverable.UnrecoverableException: out of 2 created actors only 7 alive

Here we wanted 2 actors, 7 are alive. Our job crashes.

The check

if len(actors) == len(alive):
                return actors

expects them to be equal. Is this a problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants