HF2UCP: Converting a `pytorch_model.bin` or `.safetensors` checkpoint to UCP #7212

Schwidola0607 · 2025-04-10T10:13:28Z

PR for HF2UCP feature

Converting a pytorch_model.bin or .safetensors checkpoint to UCP will

zero initialize optimizer states (exp_avg_sq.pt and exp_avg.pt)
skip over copying _model_states.pt and optimizer_state.pt files as those are not available to a HF checkpoint

Signed-off-by: Schwidola0607 <[email protected]>

xylian86 · 2025-06-23T18:53:31Z

docs/_tutorials/hugging-face-to-ucp.md

@@ -0,0 +1,36 @@
+---


How about moving this document to the Megatron-DeepSpeed repository?
The purpose of converting the Hugging Face checkpoint to UCP is to enable loading with 4D parallelism, so the Megatron-DeepSpeed repo might be a more suitable place for it.

xylian86 · 2025-06-23T19:04:06Z

deepspeed/runtime/base_optimizer.py

-        optim_sd = torch.load(optim_state_path, weights_only=False)
-
-        self._load_global_state(optim_sd)
+        if os.path.isfile(optim_state_path):


Suggestion:
Update the logic as follows:

if universal_checkpoint: # Allow missing optimizer state path, but issue a warning if optim_state_path is None: warnings else: assert os.path.isfile()

xylian86 · 2025-06-23T19:14:21Z

deepspeed/runtime/zero/stage3.py

-        optim_sd = torch.load(optim_state_path, weights_only=False)
-        self._load_global_state_stage3(optim_sd)
+        if os.path.isfile(optim_state_path):
+            ignore_missing_optim_state = False


Suggestion:
Update the logic as follows:

if universal_checkpoint: # Allow missing optimizer state path, but issue a warning if optim_state_path is None: warnings else: assert os.path.isfile()

xylian86 · 2025-06-23T19:15:02Z

deepspeed/runtime/engine.py

@@ -2947,7 +2947,7 @@ def load_checkpoint(self,

        Returns:
            A tuple of ``load_path`` and ``client_state``.
-            *``load_path``: Path of the loaded checkpoint. ``None`` if loading the checkpoint failed.
+            *``load_path``: Path of the loaded checkpoint. ``None`` if loading the checkpoint failed or loading a HF based UCP


missing comma

Schwidola0607 and others added 5 commits April 10, 2025 05:08

add support for HF2UCP feature

0e1ea4c

Signed-off-by: Schwidola0607 <[email protected]>

add user guide

727206e

Signed-off-by: Schwidola0607 <[email protected]>

edit user guide

49588a8

Signed-off-by: Schwidola0607 <[email protected]>

cleaning up

7bef517

Signed-off-by: Schwidola0607 <[email protected]>

nits

2930f2a

Signed-off-by: Schwidola0607 <[email protected]>

Schwidola0607 force-pushed the master branch from 3ccddab to 915bef8 Compare April 10, 2025 10:15

Schwidola0607 marked this pull request as ready for review April 13, 2025 08:34

Schwidola0607 requested review from tjruwase, tohtana and loadams as code owners April 13, 2025 08:34

Schwidola0607 and others added 4 commits April 15, 2025 03:34

remove ignore_missing_optim config from zero ds_config

9207df9

Signed-off-by: Schwidola0607 <[email protected]>

fix to make ucp load more lenient

8090369

Signed-off-by: Schwidola0607 <[email protected]>

nits

7b8962a

Signed-off-by: Schwidola0607 <[email protected]>

nits

2389567

Signed-off-by: Schwidola0607 <[email protected]>

Schwidola0607 force-pushed the master branch from e5ec237 to 2389567 Compare April 15, 2025 08:34

loadams and others added 2 commits April 16, 2025 17:53

Merge branch 'master' into master

f34c6df

formatting and license

2fa0889

Signed-off-by: Schwidola0607 <[email protected]>

Schwidola0607 force-pushed the master branch from 06a4a9f to 2fa0889 Compare May 2, 2025 00:51

xylian86 reviewed Jun 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HF2UCP: Converting a `pytorch_model.bin` or `.safetensors` checkpoint to UCP #7212

HF2UCP: Converting a `pytorch_model.bin` or `.safetensors` checkpoint to UCP #7212

Schwidola0607 commented Apr 10, 2025 •

edited

Loading

Uh oh!

xylian86 Jun 23, 2025

Uh oh!

xylian86 Jun 23, 2025

Uh oh!

xylian86 Jun 23, 2025

Uh oh!

xylian86 Jun 23, 2025

Uh oh!

Uh oh!

HF2UCP: Converting a pytorch_model.bin or .safetensors checkpoint to UCP #7212

Are you sure you want to change the base?

HF2UCP: Converting a pytorch_model.bin or .safetensors checkpoint to UCP #7212

Conversation

Schwidola0607 commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xylian86 Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

xylian86 Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

xylian86 Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

xylian86 Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HF2UCP: Converting a `pytorch_model.bin` or `.safetensors` checkpoint to UCP #7212

HF2UCP: Converting a `pytorch_model.bin` or `.safetensors` checkpoint to UCP #7212

Schwidola0607 commented Apr 10, 2025 •

edited

Loading