Skip to content

enable device index check for all device types #126767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

garfield1997
Copy link
Contributor

enable device index check for all device types for grad setter.

Copy link

pytorch-bot bot commented May 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126767

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 7e8d1d8 with merge base 00f675b (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@janeyx99 janeyx99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, needs test though

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 21, 2024
@albanD
Copy link
Collaborator

albanD commented May 21, 2024

The blast radius is pretty large but that sounds ok to me as any device with single hw should just return 0 all the time. Will let @soulitzer give the final stamp if he's happy with it.

Also agreed that we need test to make sure it works fine.

@garfield1997
Copy link
Contributor Author

I believe there are already some tests in place. https://github.com/pytorch/pytorch/blob/main/test/test_autograd.py#L11299

# Tests cross-device assignment raises
        if len(devices) > 1:
            x = torch.randn(5, 5, device=devices[0])
            with self.assertRaises(RuntimeError):
                x.grad = torch.randn(5, 5, device=devices[1])

Do we need additional testing? @albanD @janeyx99

@albanD
Copy link
Collaborator

albanD commented May 22, 2024

Do we run this test on non-cuda devices? I would expect that this was failing before and would be fixed with this PR for non-cuda devices?

@garfield1997
Copy link
Contributor Author

I am adding a custom backend for PyTorch. This test case used to fail and will be fixed by this PR. @albanD

@garfield1997
Copy link
Contributor Author

If you have the time, could you give me some feedback on this PR? @albanD @janeyx99

@garfield1997 garfield1997 force-pushed the remove_grad_cuda_check branch from 2fb3012 to 7d3bfe4 Compare May 30, 2024 01:18
Copy link
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me

Though also on testing, @albanD, do you know what our general testing strategy is for non-cuda devices, for things like this

@albanD
Copy link
Collaborator

albanD commented May 30, 2024

The best Plan On Record for testing these is what is described in this rfc: pytorch/rfcs#64 but it is not done yet I'm afraid.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good then!

@albanD
Copy link
Collaborator

albanD commented May 30, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 30, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@albanD albanD added the topic: not user facing topic category label May 30, 2024
@albanD
Copy link
Collaborator

albanD commented May 30, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.1-py3.10-gcc9 / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge)

Details for Dev Infra team Raised by workflow job

@garfield1997
Copy link
Contributor Author

The CI failure seems unrelated to the changes in the PR, and I am unable to reproduce it locally.

@garfield1997 garfield1997 force-pushed the remove_grad_cuda_check branch from 7d3bfe4 to ba8175f Compare June 3, 2024 01:36
@FFFrog
Copy link
Collaborator

FFFrog commented Jun 3, 2024

The best Plan On Record for testing these is what is described in this rfc: pytorch/rfcs#64 but it is not done yet I'm afraid.

cc @albanD @soulitzer @garfield1997

pytorch/rfcs#64 is currently in the design and development stage. We will do our best to speed up the development and open source it as soon as possible.

As stated in the RFC, we plan to abstract and standardize the third-party device access mechanism and in theory, it can provide the following benefits:

  1. Code abstract reuse to accelerate the integration of new elements
  2. Standardized integration to improve integration quality of third-party devices
  3. Streamlined backend and unified test coverage ensure the availability of third-party device integration mechanisms
  4. End-to-end full process documentation and official DEMO

@garfield1997
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@garfield1997
Copy link
Contributor Author

Could you please help trigger the merge again? @albanD

@garfield1997
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased remove_grad_cuda_check onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout remove_grad_cuda_check && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the remove_grad_cuda_check branch from 4ce3403 to 7e8d1d8 Compare June 26, 2024 07:49
@garfield1997
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants