run_eval_loop iterates a finite validation set to exhaustion. With FSDP, each forward pass is a collective (all-gather to unshard parameters). Ranks get different batch counts because variable-length samples produce different numbers of fixed-size tensors per rank. When any rank exits the loop first, the rest deadlock on the next all-gather.
Training is immune, it pulls a fixed grad_accum_steps from an infinite iterator. Validation iterates to exhaustion, that's where ranks diverge.
run_eval_loop iterates a finite validation set to exhaustion. With FSDP, each forward pass is a collective (all-gather to unshard parameters). Ranks get different batch counts because variable-length samples produce different numbers of fixed-size tensors per rank. When any rank exits the loop first, the rest deadlock on the next all-gather.
Training is immune, it pulls a fixed
grad_accum_stepsfrom an infinite iterator. Validation iterates to exhaustion, that's where ranks diverge.