-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address pytorch versioning issues. #820
base: main
Are you sure you want to change the base?
Conversation
Many new features of physicsnemo's distributed utilities, targeting domain parallelism, require pytorch's DTensor package which was introduced in pytorch 2.6.0. But, we don't want to limit physicsnemo usage unnecessarily. This commit introduces version checking utilities, which are then aplied to ShardTensor. If torch is below 2.6.0, the distributed utilities will not import ShardTensor but will still work. If a user attempts to import ShardTensor directly, avoiding the __init__.py file, the version checking utilities will raise an exception. Tests on shard tensor are likewise skipped if torch 2.6.0 is not installed. Finally, an additional test file is included to validate the version checking tools.
@NickGeneva can you check if this solves your earth_2 issues? One item that may be outstanding is that DeviceMesh, which is now used in DistributedManager, was introduced in pytorch 2.2.0. I suspect that is ~OK, and if not we could bump the minimum pytorch version of physicsnemo to 2.2 (not all the way to 2.6, as needed for ShardTensor). My local testing saw nearly all tests passing but a crash in the one test where the torch.distributed.init is called twice. I believe that's a pytorch bug but I want to see what the CI does with it. |
/blossom-ci |
- change shard tensor minimum version to 2.5.9 to accomodate alpha release of 2.6.0a - set minimum pytorch version for DeviceMesh to 2.4.0 - introduce function decorator that raises an exception when unavailable functions are called. - adds a little more protection in the tests to differntiate,
I've updated to include multiple levels of checking:
DistributedManager API is unchanged, but several functions are now wrapped in a Testing on ORD, I have the following results for the following containers from ngc:
I'll let the CI test 2.6.0a. |
/blossom-ci |
Many new features of physicsnemo's distributed utilities, targeting domain parallelism, require pytorch's DTensor package which was introduced in pytorch 2.6.0. But, we don't want to limit physicsnemo usage unnecessarily.
This commit introduces version checking utilities, which are then aplied to ShardTensor. If torch is below 2.6.0, the distributed utilities will not import ShardTensor but will still work. If a user attempts to import ShardTensor directly, avoiding the init.py file, the version checking utilities will raise an exception.
Tests on shard tensor are likewise skipped if torch 2.6.0 is not installed.
Finally, an additional test file is included to validate the version checking tools.
PhysicsNeMo Pull Request
Closes #815
Description
Checklist
Dependencies