GPU and MPI Hands on session
Lets start by running a simple GPU job on Kubernetes. We will use the nvidia-smi command to check the GPU's available on the node.
apiVersion: v1
kind: Pod
metadata:
namespace: sc24
generateName: sc24-gpu-
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 100 & wait"]
resources:
limits:
nvidia.com/gpu: 1Copy the above yaml to a file called gpu-pod.yaml and run the following command to create the pod.
kubectl apply -f gpu-pod.yamlFirst lets run a basic CPU MPI job on Kubernetes. We will use the MPI operator to run the job.
We will calculate the value of Pi using the Monte Carlo method. The code is written in C and is available in the mpi-pi directory.
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
generateName: sc24-mpi-pi-
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
ttlSecondsAfterFinished: 60
sshAuthMountPath: /home/mpiuser/.ssh
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: mpioperator/mpi-pi:openmpi
name: mpi-launcher
securityContext:
runAsUser: 1000
command:
- mpirun
args:
- -n
- "2"
- /home/mpiuser/pi
resources:
limits:
memory: 16Gi
cpu: 2
requests:
memory: 16Gi
cpu: 2
Worker:
replicas: 2
template:
spec:
containers:
- image: mpioperator/mpi-pi:openmpi
name: mpi-worker
securityContext:
runAsUser: 1000
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config
resources:
limits:
cpu: 1
memory: 1GiCopy the above yaml to a file called mpi-pi.yaml and run the following command to create the job.
kubectl apply -f mpi-pi.yamlNow let's run a multi-node GPU MPI job on Kubernetes. We will use the MPI operator to run the job.
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
generateName: sc24-mpi-tensorflow-
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
command:
- mpirun
- --allow-run-as-root
- -np
- "2"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
- --model=resnet101
- --batch_size=64
- --variable_update=horovod
resources:
limits:
memory: 16Gi
cpu: 4
requests:
memory: 16Gi
cpu: 4
Worker:
replicas: 2
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
resources:
limits:
nvidia.com/gpu: 1Copy the above yaml to a file called mpi-tensorflow.yaml and run the following command to create the job.
kubectl apply -f mpi-tensorflow.yamlPlease make sure you did not leave any running pods. Jobs and associated completed pods are OK.