Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters #44

viyjy · 2023-06-26T07:15:47Z

Hi, I am trying to use autoresume to continue train my failed jobs, but get the following error:

File "/opt/conda/lib/python3.9/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 243, in _check_order
RuntimeError: Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters

When I use a single node to train a model, save checkpoint, and set autoresume=True to continue the training by using a single node, it works.
However, when I use 16 nodes to train a model, save checkpoint, and use 1 or 16 nodes to do autoresume, I get the aforementioned error.
I googled it, but only find this Stack Overflow. Same error, but no answer yet.

The text was updated successfully, but these errors were encountered:

Landanjs · 2023-07-17T21:15:03Z

Apologies for the delay! Are you able to specify the checkpoint you want to load using load_path instead of autoresume=True? Or do you hit the same error?

viyjy · 2023-12-13T19:12:03Z

@Landanjs Yes, I am able to use load_path. However, the job gets stuck at the very beginning if I use load_path=/path/of/checpoint and set load_weights_only=False.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters #44

Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters #44

viyjy commented Jun 26, 2023

Landanjs commented Jul 17, 2023

viyjy commented Dec 13, 2023

Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters #44

Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters #44

Comments

viyjy commented Jun 26, 2023

Landanjs commented Jul 17, 2023

viyjy commented Dec 13, 2023