This tutorial is a companion to the Mask R-CNN distributed training blog on how to to do distributed training of TensorPack Mask/Faster-RCNN, and AWS Mask-RCNN models using Kubeflow MPI Operator on Amazon Elastic Kubernetes Service (EKS).
Before proceeding, complete the Prerequisites and Getting started.
This tutorial has following steps:
- Upload COCO 2017 training dataset to your Amazon S3 bucket
- Use Helm charts to launch training jobs in the EKS cluster
- Use Jupyter notebook to test the trained model
To download COCO 2017 dataset to your build environment instance, and upload it to your Amazon S3 bucket, replace S3_BUCKET with your bucket name and run following command:
./eks-cluster/utils/prepare-s3-bucket.sh S3_BUCKET
Note:
In the script above, by default, data is uploaded under a top-level S3 folder named ml-platform
. This folder is used in the import_path
terraform variable in the section Use Terraform to create infrastructure. It is not recommended that you change this top-level folder name. if you must change it, do a project wide search for ml-platform
and replace it with your folder name in the various YAML files.
You have two Helm charts available for training Mask-RCNN models. Both these Helm charts use the same Kubernetes namespace, which, by default, is set to kubeflow
.
To train TensorPack Mask-RCNN model, install the maskrcnn chart by executing following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug maskrcnn ./charts/machine-learning/training/maskrcnn/
By default, the training job uses p3dn.24xlarge
instance type. If you want to use p3.16xlarge
instance type instead, use following command:
helm install --debug maskrcnn ./charts/machine-learning/training/maskrcnn/ \
--set maskrcnn.gpu_instance_type=p3.16xlarge --set maskrcnn.tf_device_min_sys_mem_mb=2560
To train AWS Mask-RCNN optimized model, install the maskrcnn-optimized chart by executing following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug maskrcnn-optimized ./charts/machine-learning/training/maskrcnn-optimized/
By default, the training job uses p3dn.24xlarge
instance type, with a per GPU batch size of 4. If you want to use p3.16xlarge
instance type instead, use following command:
helm install --debug maskrcnn-optimized ./charts/machine-learning/training/maskrcnn-optimized/ \
--set maskrcnn.gpu_instance_type=p3.16xlarge --set maskrcnn.tf_device_min_sys_mem_mb=2560 \
--set maskrcnn.batch_size_per_gpu=2
Note, this solution uses EKS autoscaling to automatically scale-up (from zero nodes) and scale-down (to zero nodes) the size of the EKS managed nodegroup used for training. So, if currently your training node group has zero nodes, it may take several minutes (or longer, if GPU capacity is transiently unavailable) for the GPU nodes to be Ready
and for the training pods to reach Running
state. During this time, the maskrcnn-launcher-xxxxx
pod may crash and restart automatically several times, and that is nominal behavior. Once the maskrcnn-launcher-xxxxx
is in Running
state, replace xxxxx
with your launcher pod suffix below and execute:
kubectl logs -f maskrcnn-launcher-xxxxx -n kubeflow
This will show the live training log from the launcher pod.
Model checkpoints and all training logs are also available on the shared_fs
file-system set in values.yaml
, i.e. fsx
(default), or efs
. For fsx
(default), access your training logs as follows:
kubectl apply -f eks-cluster/utils/attach-pvc.yaml -n kubeflow
kubectl exec -it -n kubeflow attach-pvc -- /bin/bash
cd /fsx
ls -ltr maskrcnn-*
Or,
cd /efs
ls -ltr maskrcnn-*
Type exit
to exit from the attach-pvc
pod.
When training is complete, you may uninstall an installed chart by executing helm uninstall chart-name
, for example helm uninstall maskrcnn
. The logs and trained models will be preserved on the shared file system used for training.
To password protect TensorBoard, generate the password hash for your password using the command below:
htpasswd -c .htpasswd tensorboard
Copy the generated password for tensorboard
from .htpasswd
file and save it to use in steps below. Finally, clean the generated password hash file:
rm .htpasswd
To test TensorPack Mask-RCNN model, customize values.yaml, as described below:
-
Use AWS check ip to get the public IP of your web browser client. Use this public IP to set
global.source_cidr
as a/32
CIDR. This will restrict Internet access to Jupyter notebook and TensorBoard services to your public IP. -
Set
global.log_dir
to the relative path of your training log directory, for example,maskrcnn-XXXX-XX-XX-XX-XX-XX
. -
Set the generated password for
tensorboard
as a quoted MD5 hash as shown in the example below:htpasswd: "your-generated-password-hash"
To install the maskrcnn-jupyter
chart, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug maskrcnn-jupyter ./charts/machine-learning/testing/maskrcnn-jupyter/
Execute kubectl logs -f maskrcnn-jupyter-xxxxx -n kubeflow -c jupyter
to display Jupyter log. At the beginning of the Jupyter log, note the security token required to access Jupyter service in a browser.
Execute kubectl get service maskrcnn-jupyter -n kubeflow
to get the service DNS address. To test the trained model using a Jupyter notebook, access the service in a browser on port 443 using the service DNS and the security token. Your URL to access the Jupyter service should look similar to the example below:
Because the service endpoint in this tutorial uses a self-signed certificate, accessing Jupyter service in a browser will display a browser warning. If you deem it appropriate, proceed to access the service. Open the notebook, and run it it to test the trained model. Note, there may not be any trained model checkpoint available at a given time, while training is in progress.
To access TensorBoard via web, use the service DNS address noted above. Your URL to access the TensorBoard service should look similar to the example below:
https://xxxxxxxxxxxxxxxxxxxxxxxxx.elb.xx-xxxx-x.amazonaws.com:6443/
Accessing TensorBoard service in a browser will display a browser warning, because the service endpoint uses a self-signed certificate. If you deem it appropriate, proceed to access the service. When prompted for authentication, use the default username tensorboard
, and your password.
To test AWS Mask-RCNN model, customize values.yaml file, following the three steps shown for TensorPack Mask-RCNN model. Note, the log_dir
will be different, for example, maskrcnn-optimized-XXXX-XX-XX-XX-XX-XX
.
To install the maskrcnn-optimized-jupyter
chart, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug maskrcnn-optimized-jupyter ./charts/machine-learning/testing/maskrcnn-optimized-jupyter/
Execute kubectl logs -f maskrcnn-optimized-jupyter-xxxxx -n kubeflow -c jupyter
to display Jupyter log. At the beginning of the Jupyter log, note the security token required to access Jupyter service in a browser.
Execute kubectl get service maskrcnn-optimized-jupyter -n kubeflow
to get the service DNS address. The rest of the steps are the same as for TensorPack Mask-RCNN model.
When testing is complete, you may uninstall an installed chart by executing helm uninstall chart-name
, for example helm uninstall maskrcnn-jupyter
, or helm uninstall maskrcnn-optimized-jupyter
.
The COCO 2017 training data used in the tutorial is automatically imported from the S3_BUCKET
to the FSx for Lustre file-system. However, if you want to use the EFS file-system as the source for your training data, you need to customize S3_BUCKET
variable in stage-data.yaml, and run following command:
kubectl apply -f eks-cluster/utils/stage-data.yaml -n kubeflow
Execute kubectl get pods -n kubeflow
to check the status of the staging Pod. Once the status of the Pod is marked Completed
, data is successfully staged on EFS.