-
Notifications
You must be signed in to change notification settings - Fork 3.1k
fix: Use shutil.copy fallback to handle file metadata permission errors #14639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: vipnydav <[email protected]>
Signed-off-by: vipnydav <[email protected]>
Signed-off-by: vipnydav <[email protected]>
Signed-off-by: vipnydav <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you!
Hi @vipnydav. Could you please fix style with pylint & flake for the file you changed? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimapihtar id like to ask you to move the new function outside nemo.lightning in a separate pr unless the PR author is willing to do it themselves.
I'm not blocking the PR, but this is a breaking change as it is currently setup.
Fyi @ericharper
Signed-off-by: vipnydav <[email protected]>
Signed-off-by: vipnydav <[email protected]>
Hi @dimapihtar, to run the lint test and ci cd tests I need the approval and it has been waiting for more than 2 days as of now. Also, do I need to add docstring to each and every methods and classes in the file that I modified even if someone else added them? |
Signed-off-by: vipnydav <[email protected]>
Hi @dimapihtar! I have fixed the style with pylint and flake. I had to add docstring to all the classes and functions in Could you please review the PR? |
What does this PR do ?
This PR addresses an
Operation not permitted
error that occurs when saving NeMo models or checkpoints to certain filesystems where file metadata permissions are restricted. This issue has been observed in environments where the user running the application does not have ownership of the destination file, such as a GCSFuse mount created without user-specific ownership flags.The problem is caused by
shutil.copy2
, which fails if it cannot set the destination file's access and modification times due to a kernel-level permission check.This PR introduces a new helper function,
robust_copy
, which attempts to copy files with metadata usingshutil.copy2
but gracefully falls back toshutil.copy
upon aPermissionError
. This change makes file operations more resilient and prevents crashes during the checkpointing process on such filesystems.Collection: This change affects
nemo_core
and model saving operations across various collections (e.g., ASR).Changelog
nemo/lightning/io/artifact/file.py
: Introduced arobust_copy
function that wrapsshutil.copy2
in atry...except
block, falling back toshutil.copy
onPermissionError
.nemo/lightning/io/artifact/file.py
: Updated the existingcopy_file
function to use the newrobust_copy
helper.nemo/core/connectors/save_restore_connector.py
: Replaced directshutil.copy2
calls withrobust_copy
to handle artifact copying during model saving and restoration.nemo/collections/asr/parts/mixins/mixins.py
: Replaced ashutil.copy2
call withrobust_copy
for saving tokenizer files.tests/lightning/_io/artifacts/test_file.py
: Added new unit tests to verify the correct behavior ofrobust_copy
, ensuring both the success and fallback code paths are covered.Usage
This is an internal fix to improve filesystem compatibility. There are no changes to user-facing APIs. Model saving will now work seamlessly on filesystems with restrictive metadata permissions.
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
This change was made to address an issue identified by Google Cloud customers using GCSFuse, where the
shutil.copy
function would fail due to file ownership and metadata permission restrictions.Following a discussion between Google and NVIDIA, where these findings were shared, the NVIDIA team (specifically
Yu Ding
) suggested implementing a fix that attemptsshutil.copy2
and falls back toshutil.copy
using atry/except
block. This pull request implements that recommendation by introducing a centralizedrobust_copy
helper function and applying it to the relevant code paths.