-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4479: Prevent users from using a local filepath when performing a distributed write #4484
base: master
Are you sure you want to change the base?
Conversation
…n performing a distributed write Signed-off-by: Rehan Durrani <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #4484 +/- ##
==========================================
- Coverage 86.82% 78.21% -8.62%
==========================================
Files 222 230 +8
Lines 18038 23692 +5654
==========================================
+ Hits 15662 18530 +2868
- Misses 2376 5162 +2786
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD thanks for the change. I have some style suggestions.
modin/core/execution/ray/implementations/pandas_on_ray/io/io.py
Outdated
Show resolved
Hide resolved
modin/core/execution/ray/implementations/pandas_on_ray/io/io.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be added for to_sql
as well?
@@ -165,6 +166,13 @@ def to_csv(cls, qc, **kwargs): | |||
if not cls._to_csv_check_support(kwargs): | |||
return RayIO.to_csv(qc, **kwargs) | |||
|
|||
if len(ray.nodes()) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to _to_csv_check_support
, ditto for parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but _to_csv_check_support
currently just checks if we can do a parallel write, so it would be a bit of a paradigm shift to raise an error here, unless you're suggesting we return false and default to a serial write on the head node if a local path is passed?
if S3_ADDRESS_REGEX.match(path_or_buf) is not None or "://" in path_or_buf: | ||
return False # S3 or network path. | ||
if isinstance(path_or_buf, str) or isinstance(path_or_buf, pathlib.PurePath): | ||
return os.path.exists(path_or_buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code as is will not warn users for the common case where they try to write to a non-existent path, i.e. when they want to create and write to a new file. We should warn even if the path doesn't exist locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to check device_id
instead! I originally did this because apparently os.path.exists
returns false for networked file paths. I believe it should now work for both non-existing paths and networked paths.
Co-authored-by: Yaroslav Igoshev <[email protected]>
Co-authored-by: Mahesh Vashishtha <[email protected]>
Co-authored-by: Mahesh Vashishtha <[email protected]>
Co-authored-by: Mahesh Vashishtha <[email protected]>
We don't need to do this for |
Signed-off-by: Rehan Durrani <[email protected]>
…tent local paths Signed-off-by: Rehan Durrani <[email protected]>
This pull request introduces 1 alert when merging dcadcf4 into 60518ca - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD thanks for the changes. I have two minor suggestions. Do we have ray cluster unit tests to which we can add a case for this? Otherwise, have you tested this manually?
Co-authored-by: Mahesh Vashishtha <[email protected]>
Co-authored-by: Mahesh Vashishtha <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
This pull request introduces 1 alert when merging 0608f9e into 958c26a - view on LGTM.com new alerts:
|
modin/core/execution/ray/implementations/pandas_on_ray/io/io.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Rehan Durrani <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
Signed-off-by: Rehan Durrani [email protected]
What do these changes do?
Prevent users from specifying a local file path on a distributed write.
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date