-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv
fails with multi-node Ray Client
#3179
Comments
@devin-petersohn for some context, this is mostly a problem when using Modin with distributed Ray, where we have decent support for uploading files to the cluster, but their absolute path will be different (locally, the file would be under |
Thanks @amogkam and @wuisawesome for the context! You are correct that the current readers assume that the path provided is accessible by all workers with the provided path. Modin doesn't use the relative path, which would work. It is happening here:
Originally this was added because users can do an Edit: #882 was the original issue for the |
Oh interesting... One possibility here could be to make paths relative to some sort of ray base directory instead of relative to CWD. I'm not 100% sure how difficult this would be for either Modin or Ray though... Let me ask around on the Ray side. |
@devin-petersohn Will it work on Modin's side if we provide an API |
nit: maybe it should be a |
Another idea suggested by @amogkam that we're thinking about is to provide an API of the form |
I think this should be fine. But could we rename the function to be |
From the Modin side, we try to keep the parsers engine agnostic. Right now there is no Ray specific code in parsing the file. I would prefer not to convert the path on the worker side. One option is to do it right before task submission in the modin/modin/engines/ray/task_wrapper.py Lines 46 to 65 in 89f6cdd
This is not only going to be a problem for CSV, so with a small refactor it might be possible to do here for every format. I would prefer to just pass the path to a ray utility and have ray tell me the correct path for the workers, whether it's running in Client mode or not. |
that would look something like
and Ray sets that env var correctly on all modes? In the driver it would be I don't know how you would implement this at the task_wrapper level, since it's such a general function, but I'm wondering if we could do it behind the engine abstraction? For example |
There are two issues here: Solving both is only possible if: I don't think 2 or 3 will be good from a maintainability standpoint. |
This issue is closely related to #4479 - except it seems that maybe we need to add a feature to support reading from a file when Ray has already distributed it? @devin-petersohn do we plan to add support for this feature? |
System information
modin.__version__
): 0.10I create a sample
test.csv
file in the current working directory. Then I start a remote Ray cluster.I then run this script from my laptop
And it fails with
Describe the problem
It seems like Modin is expecting the csv file to exist both on the client and the server with the same path, which isn't the case, and is thus failing.
Possible solutions to this:
The text was updated successfully, but these errors were encountered: