Skip to content

samuelbohl/RelaySGD

Repository files navigation

RelaySGD

Implementation of the decentralized learning algorithm RelaySGD1 inside of Bagua2 for my Bachelor Thesis.

You can run the benchmark using an installed version of bagua with:

python3 -m bagua.distributed.launch --nproc_per_node=<number of gpus> benchmark.py --algorithm relay

You can also provide some parameters:

python3 -m bagua.distributed.launch --nproc_per_node=<number of gpus> benchmark.py --algorithm relay --lr <learning rate> --alpha <data heterogeneity parameter> --topology <relay togology e.g. chain>

Experiment Evaluation

The logs folder contains the output of all the runs.

To tune the hyperparameters, modify and run the following scripts: hpt_relay.sh and hpt_rest.sh. The output is saved in the logs folder as summary*.txt. The final_run.sh script executes the below shown experiment using the best learning rates on 8 GPUs.

The second experiment evaluates the throughput of different algorithms. (synth_benchmark_run.sh)

CIFAR10 - VGG11

Comparing the decentraliced algorithm in bagua with RelaySGD

RelaySGD vs Allreduce

Comparing different topologies of RelaySGD

Footnotes

  1. https://doi.org/10.48550/arXiv.2110.04175

  2. https://github.com/BaguaSys/bagua/tree/master

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published