Skip to content

Commit 4638513

Browse files
committed
adds barriers after checkpoint saving
non-rank-0 processes were continuing on and waiting for rank-0 forward, while rank 0 was saving artifacts. This was leading to collective timeouts. In testing, this reduced the incidence of timeouts because non-rank-0 processes are waiting at a known wait-point. Signed-off-by: James Kunstle <[email protected]>
1 parent 1532531 commit 4638513

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

src/instructlab/training/main_ds.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -505,6 +505,8 @@ def train(
505505
is_lora=bool(args.lora_r),
506506
hf_format=True,
507507
)
508+
base_logger.debug("RANK (%d) waiting at post-save barrier.", local_rank)
509+
torch.distributed.barrier()
508510

509511
# if (
510512
# args.save_samples_ds is not None
@@ -533,6 +535,8 @@ def train(
533535
hf_format=True,
534536
epoch=epoch,
535537
)
538+
base_logger.debug("RANK (%d) waiting at post-save barrier.", local_rank)
539+
torch.distributed.barrier()
536540

537541
if args.save_last:
538542
save_hf_format_accelerate(

0 commit comments

Comments
 (0)