Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap #77

Open
36 of 51 tasks
justheuristic opened this issue Jul 12, 2020 · 2 comments
Open
36 of 51 tasks

Roadmap #77

justheuristic opened this issue Jul 12, 2020 · 2 comments
Labels
discussion enhancement New feature or request

Comments

@justheuristic
Copy link
Member

justheuristic commented Jul 12, 2020

This is a global project roadmap that states our priorities for the nearest future. These priorities can and should be disputed here or elsewhere, after which we will update the roadmap.

v0.7 "It runs something" (released)

  • convert internal hivemind code to an open-source library
  • switch from dmueller/kademlia to an internal DHT implementation
  • run proof-of-concept experiments

v0.8 "It runs at scale" (released)

  • Ensure DHT scales to 1000+ nodes
    • Switch from rpcudp to gRPC (due to scalability issues)
  • Optimizer hivemind.Server for large amount of small tensors
  • Implement parallel backward in RemoteMixtureOfExperts
  • Add benchmarks for MoE and DHT performance
  • Publish to PyPI

v0.9 "It trains something" (released)

v0.10 "You can train with us" (released)

v1.0 "most of the code makes sense without reading the source" (nov-dec)

  • overhaul optimizers
  • update quickstart.md
    • use the the new optimizer instead of DecentralizedSGD
  • overhaul DHT benchmark
  • add Optimizer benchmark

v1.1 "You can set up collaborative training easily"

Target scenario: 100 volunteers training 2xl-like over the internet

  • add more examples
    • at least one should include set up guide
  • additional tutorial with computer vision (dino, imagenet, dalle?)
  • Do something about the number of open files
    • investigate what contributes to # open files
    • is there a (cheap) way to reduce that to at 4096 (or 4096) without compromising performance?
  • Support training with only client and aux peers
    • (A) ensure that aux peers can download state from clients or
    • (B) add an option for aux peer to pretend as normal with batch size = 0
  • more extreme compression: powerSGD variant(s)
  • investigate QUIC at scale
    • test hole punching
    • make sure our config fully supports relays
  • Remove duplicate CI runs
  • Add warnings to typical failure modes
  • Deprecate CollaborativeOptimizer & co

1.2 Decentralized Model-parallelism

Target scenario: 500 peers training 1B+ over the internet

  • libp2p in hivemind.server
  • proper LoadBalancedExpert
  • hivemind.Optimizer in hivemind.server
  • FP16 in hivemind.server

Important, but not urgent

  • more extreme compression: some way to integrate BNB directly
  • Security: option to use CenteredClip
  • Some means of saving expert snapshot in a fault-tolerant way Distributed expert snapshots #94
  • Some means for storing the training data (a-la scientific torrents)
  • enhanced API of hivemind.Optimizer (extract all necessary methods of StateAverager/ProgressTracker)
  • moshpit + elasticity
  • alternative linear programming variants
@justheuristic justheuristic changed the title v1.0 v0.8 Aug 27, 2020
@justheuristic justheuristic changed the title v0.8 Roadmap to v1.0 Aug 27, 2020
@justheuristic justheuristic reopened this Sep 30, 2020
@justheuristic justheuristic changed the title Roadmap to v1.0 Project Roadmap Sep 30, 2020
@justheuristic justheuristic changed the title Project Roadmap Global Roadmap Sep 30, 2020
@justheuristic justheuristic added discussion enhancement New feature or request labels Sep 30, 2020
@justheuristic justheuristic pinned this issue Sep 30, 2020
@justheuristic justheuristic changed the title Global Roadmap Roadmap Sep 30, 2020
@louis030195
Copy link

This sounds an interesting project, I like the idea of decentralized computing a bit like cryptocurrencies does, but not for currency, rather for general computing, because computing can be quite expensive.
In my mind it would have looked a bit like a Kubernetes but decentralized, without any security issues regarding others' hardware access and probably based on trading computing for a kind of currency (yes still would be cheaper than current centralized computing clouds), though.

About the roadmap, I see exciting technical details, but I don't see how people will see and find themselves sharing their resources for a common goal?
Is there any plan to develop a UI or something like?

Example: Bob and Alice want to train a GPT3 200B parameters, but Bob can only afford half the training price, same for Alice, but with this awesome UI, they could see that they match into a common goal.

@borzunov
Copy link
Member

Hi @louis030195!

probably based on trading computing for a kind of currency

Yeah, there are a couple of projects related to this idea: vast.ai provides a service for users to lease/rent each other's GPUs, and BitTensor (cc @unconst) is built around a cryptocurrency serving as an incentive for people who help train models with their GPUs.

Currently, hivemind doesn't involve any financial incentives: we assume that volunteers are motivated by having access to the training outcome and recognition in the leaderboard. However, if time shows that the financial motivation is crucial, hivemind may serve as a backend for BitTensor nodes :)

I don't see how people will see and find themselves sharing their resources for a common goal?
Is there any plan to develop a UI or something like?

For now, we assume this happens like this:

  • Initial collaborators find each other using our Discord or social media
  • They discuss their model/dataset choices and write code responsible for the model and dataset streaming
  • They create a page explaining other people how to join and advertise it (e.g., in social media)
  • People can follow instructions on the page, joining using their own GPUs or free cloud services like Google Colab

An example of such a page is our demo where we train a DALL-E-like model.

However, I definitely agree that our project will benefit from a centralized UI where a new user can see all planned/ongoing training runs and join the ones they consider interesting :)

@mryab mryab unpinned this issue Jun 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants