Skip to content

Commit 40e1da1

Browse files
author
Verdi March
committed
Bump pytorch dockerfile template
1 parent 091d536 commit 40e1da1

File tree

2 files changed

+23
-11
lines changed

2 files changed

+23
-11
lines changed

2.ami_and_containers/containers/pytorch/0.nvcr-pytorch-aws.dockerfile

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33

44
####################################################################################################
55
# This is a sample Dockerfile, with optional stanzas. Please read through this Dockerfile,
6-
# understand what it does, then create your own Dockerfile.
6+
# understand what it does, then create your own Dockerfile. Software versions are provided for
7+
# illustration only.
78
#
89
# Sample build instructions:
910
#
@@ -19,13 +20,13 @@
1920
# # Load image to local docker registry -> on head node, or new compute/build node.
2021
# docker load < /fsx/nvidia-pt-od__latest.tar
2122
####################################################################################################
22-
FROM nvcr.io/nvidia/pytorch:23.12-py3
23+
FROM nvcr.io/nvidia/pytorch:24.03-py3
2324
ENV DEBIAN_FRONTEND=noninteractive
2425

2526
# The three must-be-built packages.
2627
# Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error.
27-
ENV EFA_INSTALLER_VERSION=1.30.0
28-
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
28+
ENV EFA_INSTALLER_VERSION=1.32.0
29+
ENV AWS_OFI_NCCL_VERSION=1.9.1-aws
2930
ENV NCCL_TESTS_VERSION=master
3031

3132
## Uncomment below when this Dockerfile builds a container image with efa-installer<1.29.1 and
@@ -88,10 +89,13 @@ ENV PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH
8889
# [CUSTOM_NCCL_OPTION_1] Uncomment below stanza to install another NCCL version using the official
8990
# binaries.
9091
#
92+
# Please consult https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html to
93+
# find out the prebuilt nccl version in the parent image.
94+
#
9195
# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the
9296
# aws-ofi-ccnl.
9397
####################################################################################################
94-
#ENV NCCL_VERSION=2.19.3-1
98+
#ENV NCCL_VERSION=2.21.5-1
9599
#RUN cd /opt && \
96100
# wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb && \
97101
# dpkg -i cuda-keyring_1.0-1_all.deb && \
@@ -101,17 +105,21 @@ ENV PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH
101105

102106

103107
####################################################################################################
104-
# [CUSTOM_NCCL_OPTION_2] Install NCCL from source to the same location as the built-in ones. The
105-
# benefits of installing to the same location as the built-in version are:
108+
# [CUSTOM_NCCL_OPTION_2] Install NCCL from source to the same location as the built-in ones.
109+
#
110+
# Please consult https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html to
111+
# find out the prebuilt nccl version in the parent image.
112+
#
113+
# Installation mechanics:
106114
#
107-
# 1. There's only ever a single libnccl version offered by this image, preventing application from
108-
# mistakenly chooses a wrong version.
109-
# 2. No longer needing extra settings for LD_LIBRARY_PATH or LD_PRELOAD.
115+
# 1. Remove pre-installed nccl to ensure there's only ever a single libnccl version offered by this
116+
# image, preventing application from mistakenly chooses a wrong version.
117+
# 2. Install to default location, so no more extra settings for LD_LIBRARY_PATH or LD_PRELOAD.
110118
#
111119
# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the
112120
# aws-ofi-ccnl.
113121
####################################################################################################
114-
ENV NCCL_VERSION=2.19.3-1
122+
ENV NCCL_VERSION=2.21.5-1
115123
RUN apt-get remove -y libnccl2 libnccl-dev \
116124
&& cd /tmp \
117125
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION} \

2.ami_and_containers/containers/pytorch/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,10 @@ With that said, feel free to explore the example. Happy coding, and experimentin
1313

1414
## 1. Essential software
1515

16+
Please note that software versions in the template are provided for illustration only. For
17+
well-tested combinations, please refer to the various Dockerfile files under `3.test_cases/` and
18+
`4.validation_and_observability/0.nccl_tests/`.
19+
1620
In principle, the reference `Dockerfile` does the following:
1721

1822
- Provide PyTorch built for NVidia CUDA devices, by using a recent NVidia PyTorch image as the

0 commit comments

Comments
 (0)