Skip to content

fix(sft): sync validation iteration to prevent FSDP deadlock#2636

Merged
mikasenghaas merged 2 commits into
mainfrom
fix/sft-eval-fsdp-deadlock
May 25, 2026
Merged

fix(sft): sync validation iteration to prevent FSDP deadlock#2636
mikasenghaas merged 2 commits into
mainfrom
fix/sft-eval-fsdp-deadlock

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented May 25, 2026

Summary

  • Fix Validation loop deadlocks with FSDP due to unsynchronized iteration #2515: SFT validation deadlocks with FSDP when ranks see different numbers of variable-length packed batches.
  • Coordinate exit from run_eval_loop with a per-batch all_reduce(has_data, MIN) so every rank stops on the same iteration. Any rank exhausting its dataloader pulls the rest out together; the post-loop loss / token / NaN reductions then run in lockstep.

Why

Under FSDP, each forward is a collective (param all-gather on the dp_shard group). run_eval_loop iterates the val dataloader to exhaustion (max_epochs=1), but CatDataset drops the trailing partial chunk so per-rank batch counts diverge for variable-length data. The first rank out reaches the post-loop all_reduce(total_loss_sum) while others are still inside an FSDP all-gather → NCCL watchdog timeout.

Verification

Minimal 2-GPU repro on PrimeIntellect/Reverse-Text-SFT (seq_len=100, shuffle=true, seed=0 gives rank 0: 327 batches, rank 1: 338 batches):

max_steps = 1
dist_timeout_seconds = 60

[model]
name = "PrimeIntellect/Qwen3-0.6B"

[data]
type = "fake"
batch_size = 2
seq_len = 1024

[val]
interval = 100
eval_on_start = true

[val.data]
type = "sft"
name = "PrimeIntellect/Reverse-Text-SFT"
seq_len = 100
batch_size = 4
micro_batch_size = 1
pack_function = "cat"
shuffle = true
seed = 0
CUDA_VISIBLE_DEVICES=0,1 uv run sft @ repro.toml --deployment.num-gpus 2 --no-wandb

Before — NCCL watchdog timeout at run_eval_loop post-loop all_reduce:

[rank0]:[E... ProcessGroupNCCL.cpp:689] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9581, OpType=ALLREDUCE, ...) ran for 60016 milliseconds before timing out.
#0 all_reduce from torch/distributed/distributed_c10d.py:3068
#2 run_eval_loop from src/prime_rl/trainer/sft/train.py:299
#3 run_validation from src/prime_rl/trainer/sft/train.py:318

After:

SUCCESS  Validation | Step 0 | Loss: 6.3760
SUCCESS  SFT trainer finished!

Note

Low Risk
Small, targeted change to validation-only loop coordination; no auth, data, or training-step logic changes.

Overview
Fixes FSDP validation deadlocks when variable-length cat packing gives different micro-batch counts per rank (e.g. trailing partial chunks dropped unevenly).

run_eval_loop no longer assumes every rank can iterate the val loader the same number of times. Each iteration, ranks all_reduce a has_data flag with MIN so the loop stops together as soon as any rank exhausts its iterator, keeping every validation forward (and thus FSDP collectives) aligned before the final loss/token/NaN reductions.

Reviewed by Cursor Bugbot for commit e140a7f. Bugbot is set up for automated code reviews on this repo. Configure here.

mikasenghaas and others added 2 commits May 25, 2026 22:09
Variable-length packing yields different per-rank batch counts in the
validation dataloader. Under FSDP every forward is a collective, so the
first rank to exit run_eval_loop deadlocks the rest in the next
all-gather. Sync a has_data flag per batch and exit together as soon as
any rank exhausts its iterator.

Fixes #2515.

Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the preallocated tensor + fill_; constructing a 0-d cuda tensor
per iteration is effectively free and reads more directly.

Co-authored-by: Cursor <cursoragent@cursor.com>
@mikasenghaas mikasenghaas requested a review from samsja May 25, 2026 23:14
@mikasenghaas mikasenghaas marked this pull request as ready for review May 25, 2026 23:14
@mikasenghaas mikasenghaas merged commit 3f5ee35 into main May 25, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validation loop deadlocks with FSDP due to unsynchronized iteration

2 participants