Training loss extremely noisy during fine-tuning and randomly goes to 0 #106

zpx01 · 2024-01-26T06:20:46Z

I'm trying to fine-tune the 6.7B model on my own code dataset. I am running a multinode training with fp32 precision on NVIDIA Tesla V100 GPUs with DeepSpeed ZeRO Stage 3. My training loss seems to randomly fluctuate and go down to zero, I've attached my training loss graph below:

I'm running this on 128 GPUs with a train batch size of 1 per device and no gradient accumulation. I'm not sure what could be the cause of this as I haven't seen this happen with other models with the Llama architecture. Would appreciate any general direction to help debug this, thanks!

zpx01 · 2024-02-06T04:40:02Z

@DejianYang @pkuzqh Would appreciate any help on this ticket, thanks

Locher7 · 2024-12-06T03:08:06Z

How did you solve it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training loss extremely noisy during fine-tuning and randomly goes to 0 #106

Training loss extremely noisy during fine-tuning and randomly goes to 0 #106

zpx01 commented Jan 26, 2024

zpx01 commented Feb 6, 2024

Locher7 commented Dec 6, 2024

Training loss extremely noisy during fine-tuning and randomly goes to 0 #106

Training loss extremely noisy during fine-tuning and randomly goes to 0 #106

Comments

zpx01 commented Jan 26, 2024

zpx01 commented Feb 6, 2024

Locher7 commented Dec 6, 2024