List of refactoring and code improvement opportunities #114

rishi-s8 · 2024-10-18T20:25:16Z

I am listing a few things that would improve the performance and consistency of the code:

Use torch functions and tensors for as many things as possible, including model averaging. Reduce the use of Python data types as much as possible.
Migrate functions that use numpy and numpy arrays to torch tensors.
Ideally create an append-only log, for example, for the accuracy, loss and similar things, create a CSV at the start, and then each round just appends a line at the end instead of maintaining the whole log in memory.
As mentioned in Improve GRPC broadcast implementation #65, grpc all_gather, and receives from multiple nodes is sequential and blocks until it receives the messages in order. A better way to do this might be to interleave synchronous waiting with the actual message and when the condition is not satisfied (not in the current round or the node is too busy), move to another node and come back to this node later.
Ideally, we should not poll the current round of another node through recurrent messages. We can use something like a condition while asking for a round, and the polled node will respond when the condition is satisfied.
The choice of synchronous or not should be for each receive and not on the state of the node.

Feel free to add things to this list as a comment on this issue.

The text was updated successfully, but these errors were encountered:

rishi-s8 · 2024-10-18T20:54:44Z

Provide feedback