Seeking advice on improvement reliability of communication.  

First, thanks for the work on Hivemind, it's a great library and we have been using it extensively in https://github.com/PrimeIntellect-ai/OpenDiloco.

There are two main issues that we have encountered and I am looking for tips / best practices on how to avoid them.

* Peers don't always find each other during DHT initialization. It happened that when starting 4 peers two independent DHT will be created with 2 peers each instead. This happened even though I passed the same initial peers to all of them.  Once they all join there is rarely desync at least at the dht level. 

* Lost peers during `DecentralizedAverager.step()` . It happened that we randomly lost a peer during an all_reduce with some class that inherited `DecentralizedAverager`. They never seem to have an obvious reason why the peer left.


Both of these issues happened relatively often even when doing experiments locally (passing through localhost). And it logically gets worse when using poorly connected machines. I have the feeling they are linked and that solving it would make decentralized training with hivemind more reliable. 

My questions are:

* Is there a set of DHT/hivemind parameters that would make it more reliable? Timeout, retry mechanism? 
* Is there part of the networking code that could be at fault here and could be improved? (happy to dig more if hinted to where to look at)


Thanks in advance :pray: 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seeking advice on improvement reliability of communication. #624

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Seeking advice on improvement reliability of communication. #624

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions