Betha is a tribute to Alpa
While Alpha is the nuclear weapon for large model training and serving, Betha is a toy example of model parallel training of a simple GPT-J implementation using manual sharding and Ray core.
model.py has the broken up GPT-J model. shard.py is the Ray actor wrapper that helps flow tensors and gradients across nodes.
To run: python train --model_dir=<cached HF GPT-J model>