Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grad become NaN #53

Open
tungdq212 opened this issue Jul 19, 2023 · 3 comments
Open

Grad become NaN #53

tungdq212 opened this issue Jul 19, 2023 · 3 comments

Comments

@tungdq212
Copy link

When training on my local machine (3090 24Gb) with batch size 12, grad value become NaN after few steps
image
But I don't meet this when training on Google Cloud A100 40Gb with bs 20. Why? How I fix that?

@Landanjs
Copy link
Contributor

If you aren't seeing NaNs with larger batch sizes, I would recommend keeping the batch size high (2048 if you want to mimic our experiment) and set device_train_microbatch_size to the largest value before an OOM, in your case sounds like 12. device_train_microbatch_size is related to gradient accumulation where the amount of gradient accumulation is equal to batch_size // device_train_microbatch_size. This does multiple forward passes with a smaller set of samples, then a single backward pass when batch_size number of samples have been processed.

This is mathematically equivalent to training on a large batch size at once as long as the network does not have batch norm layers. Let me know if this works!

Extra: you can set device_train_microbatch_size to 'auto' and composer will decrease the microbatch size until it fits into memory. But this is an experimental feature, so it may not work out of the box for your use-case

@YMKiii
Copy link

YMKiii commented Aug 30, 2023

Hello, did you run on 3090 (24GB) device?

@Landanjs
Copy link
Contributor

No, we have used A100s (40GB/80GB) and H100s (80GB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants