Roadmap #77

justheuristic · 2020-07-12T22:54:35Z

This is a global project roadmap that states our priorities for the nearest future. These priorities can and should be disputed here or elsewhere, after which we will update the roadmap.

v0.7 "It runs something" (released)

convert internal hivemind code to an open-source library
switch from dmueller/kademlia to an internal DHT implementation
run proof-of-concept experiments

v0.8 "It runs at scale" (released)

Ensure DHT scales to 1000+ nodes
- Switch from rpcudp to gRPC (due to scalability issues)
Optimizer hivemind.Server for large amount of small tensors
Implement parallel backward in RemoteMixtureOfExperts
Add benchmarks for MoE and DHT performance
Publish to PyPI

v0.9 "It trains something" (released)

Averaging gating function over peers Gating function averaging #95
Speed up beam search Faster RemoteMixtureOfExperts beam_search #92
Optional tensor compression / quantization for RemoteExpert Support tensor compression (fp16/int8) #88
Refactoring concerns v0.9 refactoring concerns #98
Open-source experiments outside our test infrastructure ( e.g. this )

v0.10 "You can train with us" (released)

v1.0 "most of the code makes sense without reading the source" (nov-dec)

overhaul optimizers
- must work decently with default parameters for both examples
- see optimizer roadmap in Add an option to pre-schedule averaging #398
update quickstart.md
- use the the new optimizer instead of DecentralizedSGD
overhaul DHT benchmark
add Optimizer benchmark

v1.1 "You can set up collaborative training easily"

Target scenario: 100 volunteers training 2xl-like over the internet

1.2 Decentralized Model-parallelism

Target scenario: 500 peers training 1B+ over the internet

libp2p in hivemind.server
proper LoadBalancedExpert
hivemind.Optimizer in hivemind.server
FP16 in hivemind.server

Important, but not urgent

more extreme compression: some way to integrate BNB directly
Security: option to use CenteredClip
Some means of saving expert snapshot in a fault-tolerant way Distributed expert snapshots #94
Some means for storing the training data (a-la scientific torrents)
enhanced API of hivemind.Optimizer (extract all necessary methods of StateAverager/ProgressTracker)
moshpit + elasticity
alternative linear programming variants

louis030195 · 2022-01-13T07:15:51Z

This sounds an interesting project, I like the idea of decentralized computing a bit like cryptocurrencies does, but not for currency, rather for general computing, because computing can be quite expensive.
In my mind it would have looked a bit like a Kubernetes but decentralized, without any security issues regarding others' hardware access and probably based on trading computing for a kind of currency (yes still would be cheaper than current centralized computing clouds), though.

About the roadmap, I see exciting technical details, but I don't see how people will see and find themselves sharing their resources for a common goal?
Is there any plan to develop a UI or something like?

Example: Bob and Alice want to train a GPT3 200B parameters, but Bob can only afford half the training price, same for Alice, but with this awesome UI, they could see that they match into a common goal.

borzunov · 2022-01-13T17:12:39Z

Hi @louis030195!

probably based on trading computing for a kind of currency

Yeah, there are a couple of projects related to this idea: vast.ai provides a service for users to lease/rent each other's GPUs, and BitTensor (cc @unconst) is built around a cryptocurrency serving as an incentive for people who help train models with their GPUs.

Currently, hivemind doesn't involve any financial incentives: we assume that volunteers are motivated by having access to the training outcome and recognition in the leaderboard. However, if time shows that the financial motivation is crucial, hivemind may serve as a backend for BitTensor nodes :)

I don't see how people will see and find themselves sharing their resources for a common goal?
Is there any plan to develop a UI or something like?

For now, we assume this happens like this:

Initial collaborators find each other using our Discord or social media
They discuss their model/dataset choices and write code responsible for the model and dataset streaming
They create a page explaining other people how to join and advertise it (e.g., in social media)
People can follow instructions on the page, joining using their own GPUs or free cloud services like Google Colab

An example of such a page is our demo where we train a DALL-E-like model.

However, I definitely agree that our project will benefit from a centralized UI where a new user can see all planned/ongoing training runs and join the ones they consider interesting :)

justheuristic changed the title ~~v1.0~~ v0.8 Aug 27, 2020

justheuristic changed the title ~~v0.8~~ Roadmap to v1.0 Aug 27, 2020

justheuristic closed this as completed Aug 27, 2020

justheuristic reopened this Sep 30, 2020

justheuristic changed the title ~~Roadmap to v1.0~~ Project Roadmap Sep 30, 2020

justheuristic changed the title ~~Project Roadmap~~ Global Roadmap Sep 30, 2020

justheuristic added discussion enhancement labels Sep 30, 2020

justheuristic pinned this issue Sep 30, 2020

justheuristic changed the title ~~Global Roadmap~~ Roadmap Sep 30, 2020

mryab unpinned this issue Jun 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap #77

Roadmap #77

justheuristic commented Jul 12, 2020 •

edited

Loading

louis030195 commented Jan 13, 2022

borzunov commented Jan 13, 2022

Roadmap #77

Roadmap #77

Comments

justheuristic commented Jul 12, 2020 • edited Loading

v0.7 "It runs something" (released)

v0.8 "It runs at scale" (released)

v0.9 "It trains something" (released)

v0.10 "You can train with us" (released)

v1.0 "most of the code makes sense without reading the source" (nov-dec)

v1.1 "You can set up collaborative training easily"

1.2 Decentralized Model-parallelism

Important, but not urgent

louis030195 commented Jan 13, 2022

borzunov commented Jan 13, 2022

justheuristic commented Jul 12, 2020 •

edited

Loading