Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training loss extremely noisy during fine-tuning and randomly goes to 0 #106

Open
zpx01 opened this issue Jan 26, 2024 · 2 comments
Open

Comments

@zpx01
Copy link

zpx01 commented Jan 26, 2024

I'm trying to fine-tune the 6.7B model on my own code dataset. I am running a multinode training with fp32 precision on NVIDIA Tesla V100 GPUs with DeepSpeed ZeRO Stage 3. My training loss seems to randomly fluctuate and go down to zero, I've attached my training loss graph below:

Screenshot 2024-01-25 at 10 19 48 PM

I'm running this on 128 GPUs with a train batch size of 1 per device and no gradient accumulation. I'm not sure what could be the cause of this as I haven't seen this happen with other models with the Llama architecture. Would appreciate any general direction to help debug this, thanks!

@zpx01
Copy link
Author

zpx01 commented Feb 6, 2024

@DejianYang @pkuzqh Would appreciate any help on this ticket, thanks

@Locher7
Copy link

Locher7 commented Dec 6, 2024

How did you solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants