Skip to content

fix(sft): prevent validation deadlock with FSDP#2516

Open
zch42 wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
zch42:fix/sft-val-deadlock
Open

fix(sft): prevent validation deadlock with FSDP#2516
zch42 wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
zch42:fix/sft-val-deadlock

Conversation

@zch42
Copy link
Copy Markdown
Contributor

@zch42 zch42 commented May 16, 2026

Fix #2515: unsynchronized iteration in run_eval_loop that deadlocks when ranks have different validation batch counts.

With FSDP, each forward pass is a collective. Ranks get different batch counts from variable-length data. When one rank exits the loop first, others deadlock on the all-gather.

Fix: one scalar all-reduce per batch so all ranks exit together.


Note

Medium Risk
Changes distributed validation control flow to add a per-batch all_reduce synchronization, which could affect validation performance or mask dataloader issues but is localized to eval.

Overview
Fixes a distributed validation deadlock where ranks could iterate different numbers of validation batches under FSDP and hang in collectives.

run_eval_loop now iterates with next(..., None) and uses a per-batch dist.all_reduce on a has_data flag so all ranks exit the eval loop together before aggregating loss/token/NAN counts.

Reviewed by Cursor Bugbot for commit c890d17. Bugbot is set up for automated code reviews on this repo. Configure here.

Refactor the data iteration logic in the SFT training loop to handle cases where the data iterator may be exhausted. Replace the for loop with a while loop that checks for data availability using `next(data_iter, None)` and a tensor flag to ensure the training process exits gracefully when no more data is available. This change enhances robustness and prevents potential runtime errors during training.
@samsja
Copy link
Copy Markdown
Member

samsja commented May 16, 2026

Hey,

With FSDP, each forward pass is a collective. Ranks get different batch counts from variable-length data. When one rank exits the loop first, others deadlock on the all-gather.

I am not sure to understand when this happen

@zch42
Copy link
Copy Markdown
Contributor Author

zch42 commented May 16, 2026

Hey,

With FSDP, each forward pass is a collective. Ranks get different batch counts from variable-length data. When one rank exits the loop first, others deadlock on the all-gather.

I am not sure to understand when this happen

@samsja CatDataset (cat packing) produces a different number of packed chunks per rank for variable-length data (seq_len packing drops tails). Validation runs to exhaustion (max_epochs=1), so ranks execute different numbers of FSDP forwards. Since each forward contains collectives, some ranks exit early while others are still in all-gathers

@samsja
Copy link
Copy Markdown
Member

samsja commented May 18, 2026

Hey,

With FSDP, each forward pass is a collective. Ranks get different batch counts from variable-length data. When one rank exits the loop first, others deadlock on the all-gather.

I am not sure to understand when this happen

@samsja CatDataset (cat packing) produces a different number of packed chunks per rank for variable-length data (seq_len packing drops tails). Validation runs to exhaustion (max_epochs=1), so ranks execute different numbers of FSDP forwards. Since each forward contains collectives, some ranks exit early while others are still in all-gathers

we never had this issue, pretty sure its handle somewhere else, can you show how to reproduce the issue ?

@zch42
Copy link
Copy Markdown
Contributor Author

zch42 commented May 19, 2026

we never had this issue, pretty sure its handle somewhere else, can you show how to reproduce the issue ?

@samsja
prime-rl: 7da26cd
Setup: Qwen3.6-35B-A3B, 2 nodes × 8 GPUs (16 ranks FSDP)
Symptom: Training runs fine, hangs indefinitely when validation triggers. No error or traceback.
Repro config (deadlocks):

[val]
interval = 10

[val.data]
type = "sft"
micro_batch_size = 1
batch_size = 32
pack_function = "cat"
seq_len = 4096
shuffle = false
(val dataset: 748 examples with variable token lengths (mean=1628, std=618, max=9520)
Works with 64, 67 or 70 val examples. Hangs with 48, 85, 100, 128, 200 or 748 examples (same config otherwise).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validation loop deadlocks with FSDP due to unsynchronized iteration

2 participants