fix(sft): prevent validation deadlock with FSDP by zch42 · Pull Request #2516 · PrimeIntellect-ai/prime-rl

zch42 · 2026-05-16T00:29:37Z

Fix #2515: unsynchronized iteration in run_eval_loop that deadlocks when ranks have different validation batch counts.

With FSDP, each forward pass is a collective. Ranks get different batch counts from variable-length data. When one rank exits the loop first, others deadlock on the all-gather.

Fix: one scalar all-reduce per batch so all ranks exit together.

Note

Medium Risk
Changes distributed validation control flow to add a per-batch all_reduce synchronization, which could affect validation performance or mask dataloader issues but is localized to eval.

Overview
Fixes a distributed validation deadlock where ranks could iterate different numbers of validation batches under FSDP and hang in collectives.

run_eval_loop now iterates with next(..., None) and uses a per-batch dist.all_reduce on a has_data flag so all ranks exit the eval loop together before aggregating loss/token/NAN counts.

^{Reviewed by Cursor Bugbot for commit c890d17. Bugbot is set up for automated code reviews on this repo. Configure here.}

Refactor the data iteration logic in the SFT training loop to handle cases where the data iterator may be exhausted. Replace the for loop with a while loop that checks for data availability using `next(data_iter, None)` and a tensor flag to ensure the training process exits gracefully when no more data is available. This change enhances robustness and prevents potential runtime errors during training.

samsja · 2026-05-16T02:38:09Z

Hey,

With FSDP, each forward pass is a collective. Ranks get different batch counts from variable-length data. When one rank exits the loop first, others deadlock on the all-gather.

I am not sure to understand when this happen

zch42 · 2026-05-16T10:42:37Z

Hey,

With FSDP, each forward pass is a collective. Ranks get different batch counts from variable-length data. When one rank exits the loop first, others deadlock on the all-gather.

I am not sure to understand when this happen

@samsja CatDataset (cat packing) produces a different number of packed chunks per rank for variable-length data (seq_len packing drops tails). Validation runs to exhaustion (max_epochs=1), so ranks execute different numbers of FSDP forwards. Since each forward contains collectives, some ranks exit early while others are still in all-gathers

samsja · 2026-05-18T22:30:29Z

Hey,

With FSDP, each forward pass is a collective. Ranks get different batch counts from variable-length data. When one rank exits the loop first, others deadlock on the all-gather.

I am not sure to understand when this happen

@samsja CatDataset (cat packing) produces a different number of packed chunks per rank for variable-length data (seq_len packing drops tails). Validation runs to exhaustion (max_epochs=1), so ranks execute different numbers of FSDP forwards. Since each forward contains collectives, some ranks exit early while others are still in all-gathers

we never had this issue, pretty sure its handle somewhere else, can you show how to reproduce the issue ?

zch42 · 2026-05-19T20:22:25Z

we never had this issue, pretty sure its handle somewhere else, can you show how to reproduce the issue ?

@samsja
prime-rl: 7da26cd
Setup: Qwen3.6-35B-A3B, 2 nodes × 8 GPUs (16 ranks FSDP)
Symptom: Training runs fine, hangs indefinitely when validation triggers. No error or traceback.
Repro config (deadlocks):

[val]
interval = 10

[val.data]
type = "sft"
micro_batch_size = 1
batch_size = 32
pack_function = "cat"
seq_len = 4096
shuffle = false
(val dataset: 748 examples with variable token lengths (mean=1628, std=618, max=9520)
Works with 64, 67 or 70 val examples. Hangs with 48, 85, 100, 128, 200 or 748 examples (same config otherwise).

zch42 mentioned this pull request May 19, 2026

Validation loop deadlocks with FSDP due to unsynchronized iteration #2515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sft): prevent validation deadlock with FSDP#2516

fix(sft): prevent validation deadlock with FSDP#2516
zch42 wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
zch42:fix/sft-val-deadlock

zch42 commented May 16, 2026 •

edited

Loading

Uh oh!

samsja commented May 16, 2026

Uh oh!

zch42 commented May 16, 2026 •

edited

Loading

Uh oh!

samsja commented May 18, 2026

Uh oh!

zch42 commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zch42 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samsja commented May 16, 2026

Uh oh!

zch42 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samsja commented May 18, 2026

Uh oh!

zch42 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zch42 commented May 16, 2026 •

edited

Loading

zch42 commented May 16, 2026 •

edited

Loading

zch42 commented May 19, 2026 •

edited

Loading