-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grad become NaN #53
Comments
If you aren't seeing NaNs with larger batch sizes, I would recommend keeping the batch size high (2048 if you want to mimic our experiment) and set This is mathematically equivalent to training on a large batch size at once as long as the network does not have batch norm layers. Let me know if this works! Extra: you can set |
Hello, did you run on 3090 (24GB) device? |
No, we have used A100s (40GB/80GB) and H100s (80GB) |
When training on my local machine (3090 24Gb) with batch size 12, grad value become NaN after few steps
But I don't meet this when training on Google Cloud A100 40Gb with bs 20. Why? How I fix that?
The text was updated successfully, but these errors were encountered: