Skip to content

adds barriers after checkpoint saving #566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 27, 2025

Conversation

JamesKunstle
Copy link
Contributor

@JamesKunstle JamesKunstle commented May 22, 2025

non-rank-0 processes were continuing on and waiting for rank-0 forward, while rank 0 was saving artifacts. This was leading to collective timeouts. In testing, adding these barriers reduced the incidence of timeouts because non-rank-0 processes are waiting at a known wait-point.

non-rank-0 processes were continuing on and waiting
for rank-0 forward, while rank 0 was saving artifacts. This was leading
to collective timeouts. In testing, this reduced the incidence of
timeouts because non-rank-0 processes are waiting at a known wait-point.

Signed-off-by: James Kunstle <[email protected]>
@JamesKunstle JamesKunstle self-assigned this May 22, 2025
@mergify mergify bot added the ci-failure label May 22, 2025
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Copy link

e2e workflow succeeded on this PR: View run, congrats!

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, LGTM!

@mergify mergify bot added the one-approval label May 27, 2025
@JamesKunstle JamesKunstle merged commit a5c23c9 into instructlab:main May 27, 2025
16 of 18 checks passed
@JamesKunstle JamesKunstle deleted the post-save-barrier branch May 27, 2025 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants