Validation loop deadlocks with FSDP due to unsynchronized iteration

[run_eval_loop](https://github.com/PrimeIntellect-ai/prime-rl/blob/7da26cd65/src/prime_rl/trainer/sft/train.py#L268-L275) iterates a finite validation set to exhaustion. With FSDP, each forward pass is a collective (all-gather to unshard parameters). Ranks get different batch counts because variable-length samples produce different numbers of fixed-size tensors per rank. When any rank exits the loop first, the rest deadlock on the next all-gather.

Training is immune, it pulls a fixed `grad_accum_steps` from an infinite iterator. Validation iterates to exhaustion, that's where ranks diverge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation loop deadlocks with FSDP due to unsynchronized iteration #2515

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Validation loop deadlocks with FSDP due to unsynchronized iteration #2515

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions