Skip to content

Latest commit

 

History

History
56 lines (32 loc) · 2.46 KB

File metadata and controls

56 lines (32 loc) · 2.46 KB

Pre-train BERT on Glue MRPC dataset using Accelerate library

This example illustrates how to use pytorch-elastic Helm chart to pre-train BERT on Glue MRPC dataset with Accelerate library.

Before proceeding, complete the Prerequisites and Getting started. See What is in the YAML file to understand the common fields in the Helm values files. There are some fields that are specific to a machine learning chart.

Implicitly defined environment variables

Following variables are implicitly defined by the pytorch-elastic Helm chart for use with Torchrun elastic launch:

  1. PET_NNODES : Maps to nnodes
  2. PET_NPROC_PER_NODE : Maps to nproc_per_node
  3. PET_RDZV_ID : Maps to rdzv_id
  4. PET_RDZV_ENDPOINT: Maps to rdzv_endpoint

Launch pre-training

The pre-training Helm values are defined in pretrain.yaml.

To launch pre-training, execute:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug accel-bert \
    ./charts/machine-learning/training/pytorchjob-elastic/ \
    -f ./examples/accelerate/bert-glue-mrpc/pretrain.yaml -n kubeflow-user-example-com

You can tail the logs using following command:

kubectl logs -f pytorchjob-accel-bert-worker-0 -n kubeflow-user-example-com

To uninstall the Helm chart for pre-training job, execute:

helm uninstall accel-bert  -n kubeflow-user-example-com

Output

To access the output stored on EFS and FSx for Lustre file-systems, execute following commands:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
kubectl apply -f eks-cluster/utils/attach-pvc.yaml  -n kubeflow
kubectl exec -it -n kubeflow attach-pvc -- /bin/bash

This will put you in a pod attached to the EFS and FSx for Lustre file-systems, mounted at /efs, and /fsx, respectively. Type exit to exit the pod.

Logs

Pre-training logs are available in /efs/home/bert-glue-mrpc/logs folder.

Checkpoints

Pre-training checkpoints, if any, are available in /fsx/home/bert-glue-mrpc/checkpoints folder.

S3 Backup

Any content stored under /fsx is automatically backed up to your configured S3 bucket.