Skip to content

Support for DTensor in DecentralizedAverager #629

@samsja

Description

@samsja

in order to support the Hybrid FSDP use case where we use hivemind to do decentralized training between nodes running fsdp we need to be able to send pytorch DTensor. (At least would work with FSDP2, FSDP1 is slightly more complicated).

I see two way of supporting it :

  • Either we have one hivemind worker per node, which would call DTensor.full_weight() and then send it

or

  • Each local rank would have its hivemind worker and send and receive the DTensor.to_local()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions