-
Notifications
You must be signed in to change notification settings - Fork 62
Implicit distributed backend selection #516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
booxter
commented
Apr 30, 2025
- chore: bump pytorch to 2.6.0+
- feat: Rely on implicit detection of distributed backend
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
e2e workflow succeeded on this PR: View run, congrats! |
@tiran any particular concerns with this bump of minimal pytorch to 2.6.0+ for training library? (It's already 2.6.0+ in ilab so I'd not expect any, but better double-check...) |
As confirmed by Doug H, this won't change versions used downstream. ilab already pulls 2.6.0+ for all flavors. |
This pull request has merge conflicts that must be resolved before it can be |
This is in line with ilab repo. There are some features in later pytorch releases that we may want to have access to. Signed-off-by: Ihar Hrachyshka <[email protected]>
From the official docs, ``` Since 2.6, if backend is not provided, c10d will use a backend registered for the device type indicated by the device_id kwarg (if provided). ``` and: ``` If neither backend nor device_id is provided, c10d will detect the accelerator on the run-time machine and use a backend registered for that detected accelerator (or cpu). ``` While the library is still cuda centric, this is one tiny step towards a more agnostic implementation. Signed-off-by: Ihar Hrachyshka <[email protected]>