resume and load_from usage #1648
Unanswered
Ruining0916
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I am trying to use the resume and load_from features to resume training from my last checkpoint. Here is the issue:
From mmengine saving logic, the checkpoints will be saved as:
/work_dir
-log_dir//log
-iter_x.pth
-iter_y.pth
However, the self.resume() or self.load_from() will need get_ckpt_name from deepseed engine: here. From their function, we can see "mp_rank_" + mp_rank_str + "_model_states.pt or f"{filename}mp_rank{mp_rank_str}_model_states.pt" are required to load model states. In addition, the engine load_module_state_dict instead of directly load_state_dict, which requires "module" key in checkpoint, but is missing from current checkpoint file under work_dir.
As only iter_x.pth/iter_y.pth/lats_checkpoint exist under work_dir, get_ckpt_list can not find *_model_states.pt from work_dir, so the current error message looks like below,
I wonder could you help me clarify the correct usage of load_from and resume, I really appreciate your insights!
Beta Was this translation helpful? Give feedback.
All reactions