You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am trying to use autoresume to continue train my failed jobs, but get the following error:
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 243, in _check_order
RuntimeError: Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters
When I use a single node to train a model, save checkpoint, and set autoresume=True to continue the training by using a single node, it works.
However, when I use 16 nodes to train a model, save checkpoint, and use 1 or 16 nodes to do autoresume, I get the aforementioned error.
I googled it, but only find this Stack Overflow. Same error, but no answer yet.
The text was updated successfully, but these errors were encountered:
Apologies for the delay! Are you able to specify the checkpoint you want to load using load_path instead of autoresume=True? Or do you hit the same error?
@Landanjs Yes, I am able to use load_path. However, the job gets stuck at the very beginning if I use load_path=/path/of/checpoint and set load_weights_only=False.
Hi, I am trying to use
autoresume
to continue train my failed jobs, but get the following error:When I use a single node to train a model, save checkpoint, and set
autoresume=True
to continue the training by using a single node, it works.However, when I use 16 nodes to train a model, save checkpoint, and use 1 or 16 nodes to do
autoresume
, I get the aforementioned error.I googled it, but only find this Stack Overflow. Same error, but no answer yet.
The text was updated successfully, but these errors were encountered: