Loading a pretrained model using deepspeed fails with assert len(self.ckpt_list) > 0 #1493

pdmct · 2024-02-17T07:37:02Z

pdmct
Feb 17, 2024

Hi,

I am trying to using deepspeed (via mmengine) to train a DeepLabV3plus model from mmsegmentation on multiple GPUs. I have followed the example: examples/distributed_training_with_flexiblerunner.py which is using the stage 3 Zero.

I would like to load the pretrained model weights so i can fine tune it on my dataset, ie I am using the pretrained model, eg:
https://download.openmmlab.com/mmsegmentation/v0.5/deeplabv3plus/deeplabv3plus_r101-d8_512x1024_40k_cityscapes/deeplabv3plus_r101-d8_512x1024_40k_cityscapes_20200605_094614-3769eecf.pth
however I am hitting this assertion assert len(self.ckpt_list) > 0 as it looks like the checkpoints it is looking for are on a per rank basis.

Here is the stack trace:

Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/clearml/binding/hydra_bind.py", line 230, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "/root/.clearml/venvs-builds/3.9/task_repository/cv-asset-mmsegmentation.git/dist_train.py", line 217, in train
    runner.train()
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/mmengine/runner/_flexible_runner.py", line 1195, in train
    self.load_or_resume()
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/mmengine/runner/_flexible_runner.py", line 1144, in load_or_resume
    self.load_checkpoint(self._load_from)
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/mmengine/runner/_flexible_runner.py", line 1528, in load_checkpoint
    self.strategy.load_checkpoint(
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/mmengine/_strategy/deepspeed.py", line 434, in load_checkpoint
    _, extra_ckpt = self.model.load_checkpoint(
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2750, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2802, in _load_checkpoint
    sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list, checkpoint_engine=self.checkpoint_engine)
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/deepspeed/runtime/state_dict_factory.py", line 43, in get_sd_loader
    return MegatronSDLoader(ckpt_list, version, checkpoint_engine)
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/deepspeed/runtime/state_dict_factory.py", line 193, in __init__
    super().__init__(ckpt_list, version, checkpoint_engine)
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/deepspeed/runtime/state_dict_factory.py", line 55, in __init__
    self.check_ckpt_list()
  File "/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages/deepspeed/runtime/state_dict_factory.py", line 168, in check_ckpt_list
    assert len(self.ckpt_list) > 0
AssertionError

I couldn't find any mention in the documentation about this situation (if there is some please point it out)
How would I go about starting with this pretrained model?
Is there a process of converting this pretraining model into a per rank model that can be loaded?
Is it just a case of renaming the pretrained weights checkpoint for eack rank to align it with the filenames that are expected by deepspeed, eg from the deepspeed code it is looking for checkpoint names like:

filename = "zero_pp_rank_{}".format(dist.get_rank(group=self.optimizer.dp_process_group))
            ckpt_name = os.path.join(
                checkpoints_path,
                str(tag),
                f"{filename}_mp_rank_{mp_rank_str}_model_states.pt",
            )

Thanks

Answered by pdmct

Feb 21, 2024

Answer is don't use load_from: for pretrained model ... using init_cfg

View full answer

pdmct · 2024-02-21T20:47:04Z

pdmct
Feb 21, 2024
Author

Answer is don't use load_from: for pretrained model ... using init_cfg

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading a pretrained model using deepspeed fails with assert len(self.ckpt_list) > 0 #1493

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Loading a pretrained model using deepspeed fails with assert len(self.ckpt_list) > 0 #1493

pdmct Feb 17, 2024

Replies: 1 comment

pdmct Feb 21, 2024 Author

pdmct
Feb 17, 2024

pdmct
Feb 21, 2024
Author