-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [ROCm] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError #6725
Comments
finetune_deepseek_ds.py
dp.yaml
ds_config2_zero3.json
|
nikhil-tensorwave
changed the title
[BUG] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError
[BUG] [ROCm] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError
Nov 8, 2024
Thanks @nikhil-tensorwave. Tagging @rraminen and @jithunnair-amd as well for help on the AMD side. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
I am trying to fine-tune DeepSeek-Coder-V2-Lite-Instruct (16B) on a system with 8 MI300X GPUs. Running on any number of GPUs less than 8 works as expected and runs to completion. When running on 8 GPUs, the training starts, hangs, and then outputs one of two errors. One error is:
where the GPU node is different from run to run.
The second error (truncated) is:
To Reproduce
Run command:
Training script and config files will be in the first comment.
ds_report output
System info:
Launcher context
Launching with deepspeed
Additional context
Running the same fine-tuning instead with smaller DeepSeek models (1B and 7B) works on 8 GPUs to completion. I am currently trying the largest DeepSeek model (200B).
@rraminen @jithunnair-amd
The text was updated successfully, but these errors were encountered: