-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device mismatch during evaluation when training on mps #2385
Comments
I observed the same problem. My hack was to do |
Thanks for the bug report @antoinebrl and @erosenthal-square . We'll take a look and try a fix. |
Thanks @hanlint ! @antoinebrl Yup, I did the same thing! Relatedly, I also had to convert |
@hanlint, Any update on this issue? The transfer back to CPU is a major slowdown. |
I believe this community PR addresses the issue -- sorry for the delay! With respect to skipping it, I think this is an issue with MPS reliability and unfortunately outside our control :( If it's avoidable, we can remove it. Unfortunately I don't have a great test bed to debug these numerical issues on macs |
@hanlint I saw that in #1405 you reported an issue on M1 devices with I would be interested to know if you have any snippets to share to reproduce the issue? While I made a fix for this issue I feel like the code transporting metrics back to cpu in case of an MPS device could be dropped. What @antoinebrl ended up doing is actually transporting the outputs back to the MPS device (since the batch stayed there) and it was working so I want to confirm we can drop this change. |
@hyenal I believe it was based on Lightning-AI/torchmetrics#1727 I just reran the examples provided and have encountered the same error :( Given this, I assume most users prefer having correct results even if it is slower |
I ran into an issue trying to train flan-t5 on an M1 using torchmetrics. Training metrics worked fine, but I got the following stacktrace when calculating evaluation metrics:
I believe the issue is related to the following snippet of code:
composer/composer/trainer/trainer.py
Lines 2846 to 2858 in ff59e86
The
outputs
tensor is explicitly moved tocpu
if it's onmps
, but thebatch
tensor is not. Hence, you inevitably have a device mismatch when updating metrics. AFAICT,outputs
are not explicitly moved tocpu
when they're onmps
when updating training metrics which is why I only saw this bug during evaluation.If there really are numerical errors with
torchmetrics
onmps
, then training metrics probably ought to be calculated oncpu
in order to bring parity to the training and eval calculations. Additionally, thebatch
tensor will need to be moved tocpu
.The text was updated successfully, but these errors were encountered: