fix(sft): sync validation iteration to prevent FSDP deadlock#2636
Merged
Conversation
Variable-length packing yields different per-rank batch counts in the validation dataloader. Under FSDP every forward is a collective, so the first rank to exit run_eval_loop deadlocks the rest in the next all-gather. Sync a has_data flag per batch and exit together as soon as any rank exhausts its iterator. Fixes #2515. Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the preallocated tensor + fill_; constructing a 0-d cuda tensor per iteration is effectively free and reads more directly. Co-authored-by: Cursor <cursoragent@cursor.com>
samsja
approved these changes
May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
run_eval_loopwith a per-batchall_reduce(has_data, MIN)so every rank stops on the same iteration. Any rank exhausting its dataloader pulls the rest out together; the post-loop loss / token / NaN reductions then run in lockstep.Why
Under FSDP, each forward is a collective (param all-gather on the dp_shard group).
run_eval_loopiterates the val dataloader to exhaustion (max_epochs=1), butCatDatasetdrops the trailing partial chunk so per-rank batch counts diverge for variable-length data. The first rank out reaches the post-loopall_reduce(total_loss_sum)while others are still inside an FSDP all-gather → NCCL watchdog timeout.Verification
Minimal 2-GPU repro on
PrimeIntellect/Reverse-Text-SFT(seq_len=100,shuffle=true,seed=0gives rank 0: 327 batches, rank 1: 338 batches):Before — NCCL watchdog timeout at
run_eval_looppost-loopall_reduce:After:
Note
Low Risk
Small, targeted change to validation-only loop coordination; no auth, data, or training-step logic changes.
Overview
Fixes FSDP validation deadlocks when variable-length
catpacking gives different micro-batch counts per rank (e.g. trailing partial chunks dropped unevenly).run_eval_loopno longer assumes every rank can iterate the val loader the same number of times. Each iteration, ranksall_reduceahas_dataflag withMINso the loop stops together as soon as any rank exhausts its iterator, keeping every validation forward (and thus FSDP collectives) aligned before the final loss/token/NaN reductions.Reviewed by Cursor Bugbot for commit e140a7f. Bugbot is set up for automated code reviews on this repo. Configure here.