This example illustrates how to use pytorch-elastic Helm chart to pre-train BERT on Glue MRPC dataset with Accelerate library.
Before proceeding, complete the Prerequisites and Getting started. See What is in the YAML file to understand the common fields in the Helm values files. There are some fields that are specific to a machine learning chart.
Following variables are implicitly defined by the pytorch-elastic Helm chart for use with Torchrun elastic launch:
PET_NNODES
: Maps tonnodes
PET_NPROC_PER_NODE
: Maps tonproc_per_node
PET_RDZV_ID
: Maps tordzv_id
PET_RDZV_ENDPOINT
: Maps tordzv_endpoint
The pre-training Helm values are defined in pretrain.yaml.
To launch pre-training, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug accel-bert \
./charts/machine-learning/training/pytorchjob-elastic/ \
-f ./examples/accelerate/bert-glue-mrpc/pretrain.yaml -n kubeflow-user-example-com
You can tail the logs using following command:
kubectl logs -f pytorchjob-accel-bert-worker-0 -n kubeflow-user-example-com
To uninstall the Helm chart for pre-training job, execute:
helm uninstall accel-bert -n kubeflow-user-example-com
To access the output stored on EFS and FSx for Lustre file-systems, execute following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
kubectl apply -f eks-cluster/utils/attach-pvc.yaml -n kubeflow
kubectl exec -it -n kubeflow attach-pvc -- /bin/bash
This will put you in a pod attached to the EFS and FSx for Lustre file-systems, mounted at /efs
, and /fsx
, respectively. Type exit
to exit the pod.
Pre-training logs
are available in /efs/home/bert-glue-mrpc/logs
folder.
Pre-training checkpoints
, if any, are available in /fsx/home/bert-glue-mrpc/checkpoints
folder.
Any content stored under /fsx
is automatically backed up to your configured S3 bucket.