Skip to content

Validation loop deadlocks with FSDP due to unsynchronized iteration #2515

@zch42

Description

@zch42

run_eval_loop iterates a finite validation set to exhaustion. With FSDP, each forward pass is a collective (all-gather to unshard parameters). Ranks get different batch counts because variable-length samples produce different numbers of fixed-size tensors per rank. When any rank exits the loop first, the rest deadlock on the next all-gather.

Training is immune, it pulls a fixed grad_accum_steps from an infinite iterator. Validation iterates to exhaustion, that's where ranks diverge.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions