You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to fine-tune the 6.7B model on my own code dataset. I am running a multinode training with fp32 precision on NVIDIA Tesla V100 GPUs with DeepSpeed ZeRO Stage 3. My training loss seems to randomly fluctuate and go down to zero, I've attached my training loss graph below:
I'm running this on 128 GPUs with a train batch size of 1 per device and no gradient accumulation. I'm not sure what could be the cause of this as I haven't seen this happen with other models with the Llama architecture. Would appreciate any general direction to help debug this, thanks!
The text was updated successfully, but these errors were encountered:
I'm trying to fine-tune the 6.7B model on my own code dataset. I am running a multinode training with fp32 precision on NVIDIA Tesla V100 GPUs with DeepSpeed ZeRO Stage 3. My training loss seems to randomly fluctuate and go down to zero, I've attached my training loss graph below:
I'm running this on 128 GPUs with a train batch size of 1 per device and no gradient accumulation. I'm not sure what could be the cause of this as I haven't seen this happen with other models with the Llama architecture. Would appreciate any general direction to help debug this, thanks!
The text was updated successfully, but these errors were encountered: