Grad become NaN #53

tungdq212 · 2023-07-19T06:51:46Z

When training on my local machine (3090 24Gb) with batch size 12, grad value become NaN after few steps

But I don't meet this when training on Google Cloud A100 40Gb with bs 20. Why? How I fix that?

Landanjs · 2023-07-20T18:43:24Z

If you aren't seeing NaNs with larger batch sizes, I would recommend keeping the batch size high (2048 if you want to mimic our experiment) and set device_train_microbatch_size to the largest value before an OOM, in your case sounds like 12. device_train_microbatch_size is related to gradient accumulation where the amount of gradient accumulation is equal to batch_size // device_train_microbatch_size. This does multiple forward passes with a smaller set of samples, then a single backward pass when batch_size number of samples have been processed.

This is mathematically equivalent to training on a large batch size at once as long as the network does not have batch norm layers. Let me know if this works!

Extra: you can set device_train_microbatch_size to 'auto' and composer will decrease the microbatch size until it fits into memory. But this is an experimental feature, so it may not work out of the box for your use-case

YMKiii · 2023-08-30T10:47:10Z

Hello, did you run on 3090 (24GB) device?

Landanjs · 2023-08-30T21:10:48Z

No, we have used A100s (40GB/80GB) and H100s (80GB)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grad become NaN #53

Grad become NaN #53

tungdq212 commented Jul 19, 2023

Landanjs commented Jul 20, 2023

YMKiii commented Aug 30, 2023

Landanjs commented Aug 30, 2023

Grad become NaN #53

Grad become NaN #53

Comments

tungdq212 commented Jul 19, 2023

Landanjs commented Jul 20, 2023

YMKiii commented Aug 30, 2023

Landanjs commented Aug 30, 2023