Skip to content

Commit a5c23c9

Browse files
authored
Merge pull request #566 from JamesKunstle/post-save-barrier
adds barriers after checkpoint saving
2 parents 1532531 + 4638513 commit a5c23c9

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

src/instructlab/training/main_ds.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -505,6 +505,8 @@ def train(
505505
is_lora=bool(args.lora_r),
506506
hf_format=True,
507507
)
508+
base_logger.debug("RANK (%d) waiting at post-save barrier.", local_rank)
509+
torch.distributed.barrier()
508510

509511
# if (
510512
# args.save_samples_ds is not None
@@ -533,6 +535,8 @@ def train(
533535
hf_format=True,
534536
epoch=epoch,
535537
)
538+
base_logger.debug("RANK (%d) waiting at post-save barrier.", local_rank)
539+
torch.distributed.barrier()
536540

537541
if args.save_last:
538542
save_hf_format_accelerate(

0 commit comments

Comments
 (0)