Argonne 2021 GPU Hackathon

Building

For general instruction see ThetaGPU doc. Specifically, check the bottom of that doc if you're using bash as there are special instructions.

Login to one of the GPU service nodes thetagpusn1,2

ssh thetagpusn1

Put the follow to your ~/.bashrc (or similar) or execute after each login.

# proxies so that can clone Parthenon from the internet (and get other data if required)
export http_proxy=http://proxy.tmi.alcf.anl.gov:3128
export https_proxy=http://proxy.tmi.alcf.anl.gov:3128

# A more recent version is currently not available from the module system so we set the path manually
export PATH=/soft/buildtools/cmake/3.14.5/bin:$PATH

# Parthenon machine file that includes all required paths and options
export MACHINE_CFG=/grand/gpu_hack/parthenon/ref/ThetaGPU.cmake

Compiling needs to be done on the DGX nodes, e.g., in an interactive session via

qsub -t 60 -n 1 -q single-gpu -A gpu_hack -I

Get Parthenon

git clone https://github.com/lanl/parthenon.git
cd parthenon
git submodule init
git submodule update

Build Parthenon

mkdir build

# builds for cuda with mpi (default)
cmake ..

# OR cuda and no mpi
cmake -DMACHINE_VARIANT=cuda ..

# OR host(gcc) with mpi
cmake -DMACHINE_VARIANT=mpi ..

Test problems

Two input files are located in the /grand/gpu_hack/parthenon/ref/ folder

parthinput.block32 with 32^3 blocks
parthinput.block16 with 16^3 blocks (add more stress to the AMR part)

Some high level numbers (no profiling data) and sample output

A sample output may look like

./example/advection/advection-example -i /grand/gpu_hack/parthenon/ref/parthinput.block16
...
cycle=3 time=8.7890624999999991e-04 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.81e+07 wsec_step=1.72e+00 zone-cycles/wsec=4.08e+06 wsec_AMR=5.92e+00
-------------- New Mesh structure after (de)refinement -------------
Root grid = 16 x 16 x 16 MeshBlocks
Total number of MeshBlocks = 7687
Number of physical refinement levels = 2
Number of logical  refinement levels = 6
  Physical level = 0 (logical level = 4): 3753 MeshBlocks, cost = 3753
  Physical level = 1 (logical level = 5): 2574 MeshBlocks, cost = 2574
  Physical level = 2 (logical level = 6): 1360 MeshBlocks, cost = 1360
--------------------------------------------------------------------
cycle=4 time=1.1718750000000000e-03 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.80e+07 wsec_step=1.75e+00 zone-cycles/wsec=1.80e+07 wsec_AMR=4.92e-04
cycle=5 time=1.4648437500000000e-03 dt=2.9296874999999999e-04 zone-cycles/wsec_step=2.09e+07 wsec_step=1.50e+00 zone-cycles/wsec=2.09e+07 wsec_AMR=5.09e-04

Driver completed.
time=1.46e-03 cycle=5
tlim=1.00e+00 nlim=5

Number of MeshBlocks = 7687; 3591  created, 0 destroyed during this simulation.

The interesting/relevant information here are

the "New Mesh structure after (de)refinement" message indicates that load balancing and/or mesh refinement happened
the performance per cycle, e.g., cycle=3 time=8.7890624999999991e-04 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.81e+07 wsec_step=1.72e+00 zone-cycles/wsec=4.08e+06 wsec_AMR=5.92e+00 where the last number shows how much time (in wall seconds) was spent just doing AMR/loadbalancing. In this example this is 5.92 seconds compared to 1.72 seconds (wsec_step=1.72e+00) required for a timestep itself (without load balancing/AMR). Note that the timestep number is artificially bad given that not all parts of the test problem have been converted to the "pack of blocks" (MeshBlockPack) approach.
Thus, the (or one of the) main goal of the Hackathon is to reduce the wsec_AMR time (as the wsec_step will automatically become better along the way with existing approaches)

A first reference

For the config files provided above on ThetaGPU

Config	`wsec_AMR`	`wsec_step`	ratio
1 GPU 16^3	5.92	1.72	3.44
1 Host core 16^3	1.1	2.0	0.55
1 GPU 32^3	2.20	0.89	2.47
1 Host core 32^3	1.57	3.25	0.48

Note, that the ratio should interpreted with care (there's little "computation" and lots of (host) "management" tasks in the load balancing and refinement so it's not a fair comparison).

You may also notice that the startup time on GPUs is significantly longer than on host only, which is likely also related to the mesh initialization that also includes refinement and creating lots of blocks (and thus memory allocations).

Collecting profiling data

Build the connector (which allows named regions and kernel names to be properly shown in the profiler rather than the long names deduced from the templates)

git clone https://github.com/kokkos/kokkos-tools.git
cd kokkos-tools/profiling/nvprof-connector
export CUDA_ROOT=/usr/local/cuda
make

Now there should be a kp_nvprof_connector.so file.

Collect the data (using Nsight Systems as nvprof and nvvp are deprecated)

# "enable" the Kokkos profiling tool
export KOKKOS_PROFILE_LIBRARY=/PATH/TO/kokkos-tools/profiling/nvprof-connector/kp_nvprof_connector.so


# collect the actual data
nsys profile -o my_profile ./example/advection/advection-example -i /grand/gpu_hack/parthenon/ref/parthinput.block16

Now there should be a my_profile.qdrep file (that should be copied to system with a desktop environment).

Analyze the data

Start the GUI (nsys-ui)
Go to Tools - Options and set "Rename CUDA Kernels by NVTX" to "Yes" to get the Kokkos labels shown on the GUI.

Provide feedback

Saved searches