This is a repository of NVIDIA's submission to the MLPerf HPC v3.0 benchmark. It includes implementations of the benchmark code optimized for running on NVIDIA DGX H100. The reference implementations can be found elsewhere: https://github.com/mlcommons/hpc.git
This readme was updated in October 2023, for the v3.0 round of MLPerf HPC.
Each implementation in the benchmarks
subdirectory provides the following:
- Code that implements the model in at least one framework.
- A Dockerfile which can be used to build a container for the benchmark.
- Documentation on the dataset, model, and machine setup.
These benchmarks have been tested on the following machine configuration:
- An NVIDIA DGX SuperPOD™ with NVIDIA DGX H100 servers with 8x80GB H100 SXM gpus.
- The required software stack includes Slurm, with Enroot for running containers and the Slurm Pyxis plugin
Generally, a benchmark can be run with the following steps:
- Follow the instructions in the README to download and format the input data and any required checkpoints.
- Build the Dockerfile
- Source the appropriate
config_*.sh
file. sbatch -N $DGXNNODES -t $WALLTIME run.sub