fix(sft): sync validation iteration to prevent FSDP deadlock by mikasenghaas · Pull Request #2636 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-05-25T22:09:25Z

Summary

Fix Validation loop deadlocks with FSDP due to unsynchronized iteration #2515: SFT validation deadlocks with FSDP when ranks see different numbers of variable-length packed batches.
Coordinate exit from run_eval_loop with a per-batch all_reduce(has_data, MIN) so every rank stops on the same iteration. Any rank exhausting its dataloader pulls the rest out together; the post-loop loss / token / NaN reductions then run in lockstep.

Why

Under FSDP, each forward is a collective (param all-gather on the dp_shard group). run_eval_loop iterates the val dataloader to exhaustion (max_epochs=1), but CatDataset drops the trailing partial chunk so per-rank batch counts diverge for variable-length data. The first rank out reaches the post-loop all_reduce(total_loss_sum) while others are still inside an FSDP all-gather → NCCL watchdog timeout.

Verification

Minimal 2-GPU repro on PrimeIntellect/Reverse-Text-SFT (seq_len=100, shuffle=true, seed=0 gives rank 0: 327 batches, rank 1: 338 batches):

max_steps = 1
dist_timeout_seconds = 60

[model]
name = "PrimeIntellect/Qwen3-0.6B"

[data]
type = "fake"
batch_size = 2
seq_len = 1024

[val]
interval = 100
eval_on_start = true

[val.data]
type = "sft"
name = "PrimeIntellect/Reverse-Text-SFT"
seq_len = 100
batch_size = 4
micro_batch_size = 1
pack_function = "cat"
shuffle = true
seed = 0

CUDA_VISIBLE_DEVICES=0,1 uv run sft @ repro.toml --deployment.num-gpus 2 --no-wandb

Before — NCCL watchdog timeout at run_eval_loop post-loop all_reduce:

[rank0]:[E... ProcessGroupNCCL.cpp:689] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9581, OpType=ALLREDUCE, ...) ran for 60016 milliseconds before timing out.
#0 all_reduce from torch/distributed/distributed_c10d.py:3068
#2 run_eval_loop from src/prime_rl/trainer/sft/train.py:299
#3 run_validation from src/prime_rl/trainer/sft/train.py:318

After:

SUCCESS  Validation | Step 0 | Loss: 6.3760
SUCCESS  SFT trainer finished!

Note

Low Risk
Small, targeted change to validation-only loop coordination; no auth, data, or training-step logic changes.

Overview
Fixes FSDP validation deadlocks when variable-length cat packing gives different micro-batch counts per rank (e.g. trailing partial chunks dropped unevenly).

run_eval_loop no longer assumes every rank can iterate the val loader the same number of times. Each iteration, ranks all_reduce a has_data flag with MIN so the loop stops together as soon as any rank exhausts its iterator, keeping every validation forward (and thus FSDP collectives) aligned before the final loss/token/NaN reductions.

^{Reviewed by Cursor Bugbot for commit e140a7f. Bugbot is set up for automated code reviews on this repo. Configure here.}

Variable-length packing yields different per-rank batch counts in the validation dataloader. Under FSDP every forward is a collective, so the first rank to exit run_eval_loop deadlocks the rest in the next all-gather. Sync a has_data flag per batch and exit together as soon as any rank exhausts its iterator. Fixes #2515. Co-authored-by: Cursor <cursoragent@cursor.com>

Drop the preallocated tensor + fill_; constructing a 0-d cuda tensor per iteration is effectively free and reads more directly. Co-authored-by: Cursor <cursoragent@cursor.com>

mikasenghaas and others added 2 commits May 25, 2026 22:09

refactor(sft): inline has_data tensor in eval sync loop

e140a7f

Drop the preallocated tensor + fill_; constructing a 0-d cuda tensor per iteration is effectively free and reads more directly. Co-authored-by: Cursor <cursoragent@cursor.com>

mikasenghaas requested a review from samsja May 25, 2026 23:14

mikasenghaas marked this pull request as ready for review May 25, 2026 23:14

samsja approved these changes May 25, 2026

View reviewed changes

mikasenghaas merged commit 3f5ee35 into main May 25, 2026
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sft): sync validation iteration to prevent FSDP deadlock#2636

fix(sft): sync validation iteration to prevent FSDP deadlock#2636
mikasenghaas merged 2 commits into
mainfrom
fix/sft-eval-fsdp-deadlock

mikasenghaas commented May 25, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikasenghaas commented May 25, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented May 25, 2026 •

edited by cursor Bot

Loading