In this example, we will go through the steps required for easily adapt your PyTorch code for training a Machine Learning (ML) model by using Hugging Face and BERT as model type on an Amazon EC2 instance by using AWS Trainium chip.
In this repository, we are sharing some code examples for:
- Train BERT ML model by using PyTorch and Hugging Face
- Code: single Neuron Core
- Notebook: notebook single Neuron Core
- Distributed training of BERT ML model by using PyTorch and Hugging Face
- Instance Image: Deep Learning AMI Neuron PyTorch 1.11
- Instance Type: trn1.32xlarge
- Git installed on the EC2 instance
git --version
source /opt/aws_neuron_venv_pytorch/bin/activate
neuron-ls
neuron-top
Activate pre-built PyTorch environment
Test the code execution by using the provided notebook
cd examples/01-trainium-single-core
python3 train.py
Activate pre-built PyTorch environment
Test the code execution by using the provided notebook
cd examples/02-trainium-distributed-training
export TOKENIZERS_PARALLELISM=false
torchrun --nproc_per_node=32 train.py
- Flush Neuron Cores
sudo rmmod neuron; sudo modprobe neuron