-
Notifications
You must be signed in to change notification settings - Fork 37
Argonne 2021 GPU Hackathon
For general instruction see ThetaGPU doc.
Specifically, check the bottom of that doc if you're using bash
as there are special instructions.
- Login to one of the GPU service nodes
thetagpusn1,2
ssh thetagpusn1
- Put the follow to your
~/.bashrc
(or similar) or execute after each login.
# proxies so that can clone Parthenon from the internet (and get other data if required)
export http_proxy=http://proxy.tmi.alcf.anl.gov:3128
export https_proxy=http://proxy.tmi.alcf.anl.gov:3128
# A more recent version is currently not available from the module system so we set the path manually
export PATH=/soft/buildtools/cmake/3.14.5/bin:$PATH
# Parthenon machine file that includes all required paths and options
export MACHINE_CFG=/grand/gpu_hack/parthenon/ref/ThetaGPU.cmake
- Compiling needs to be done on the DGX nodes, e.g., in an interactive session via
qsub -t 60 -n 1 -q single-gpu -A gpu_hack -I
- Get Parthenon
git clone https://github.com/lanl/parthenon.git
cd parthenon
git submodule init
git submodule update
- Build Parthenon
mkdir build
# builds for cuda with mpi (default)
cmake ..
# OR cuda and no mpi
cmake -DMACHINE_VARIANT=cuda ..
# OR host(gcc) with mpi
cmake -DMACHINE_VARIANT=mpi ..
Two input files are located in the /grand/gpu_hack/parthenon/ref/
folder
-
parthinput.block32
with 32^3 blocks -
parthinput.block16
with 16^3 blocks (add more stress to the AMR part)
A sample output may look like
./example/advection/advection-example -i /grand/gpu_hack/parthenon/ref/parthinput.block16
...
cycle=3 time=8.7890624999999991e-04 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.81e+07 wsec_step=1.72e+00 zone-cycles/wsec=4.08e+06 wsec_AMR=5.92e+00
-------------- New Mesh structure after (de)refinement -------------
Root grid = 16 x 16 x 16 MeshBlocks
Total number of MeshBlocks = 7687
Number of physical refinement levels = 2
Number of logical refinement levels = 6
Physical level = 0 (logical level = 4): 3753 MeshBlocks, cost = 3753
Physical level = 1 (logical level = 5): 2574 MeshBlocks, cost = 2574
Physical level = 2 (logical level = 6): 1360 MeshBlocks, cost = 1360
--------------------------------------------------------------------
cycle=4 time=1.1718750000000000e-03 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.80e+07 wsec_step=1.75e+00 zone-cycles/wsec=1.80e+07 wsec_AMR=4.92e-04
cycle=5 time=1.4648437500000000e-03 dt=2.9296874999999999e-04 zone-cycles/wsec_step=2.09e+07 wsec_step=1.50e+00 zone-cycles/wsec=2.09e+07 wsec_AMR=5.09e-04
Driver completed.
time=1.46e-03 cycle=5
tlim=1.00e+00 nlim=5
Number of MeshBlocks = 7687; 3591 created, 0 destroyed during this simulation.
The interesting/relevant information here are
- the "New Mesh structure after (de)refinement" message indicates that load balancing and/or mesh refinement happened
- the performance per cycle, e.g.,
cycle=3 time=8.7890624999999991e-04 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.81e+07 wsec_step=1.72e+00 zone-cycles/wsec=4.08e+06 wsec_AMR=5.92e+00
where the last number shows how much time (in wall seconds) was spent just doing AMR/loadbalancing. In this example this is 5.92 seconds compared to 1.72 seconds (wsec_step=1.72e+00
) required for a timestep itself (without load balancing/AMR). Note that the timestep number is artificially bad given that not all parts of the test problem have been converted to the "pack of blocks" (MeshBlockPack
) approach. - Thus, the (or one of the) main goal of the Hackathon is to reduce the
wsec_AMR
time (as thewsec_step
will automatically become better along the way with existing approaches)
For the config files provided above on ThetaGPU
Config | wsec_AMR |
wsec_step |
ratio |
---|---|---|---|
1 GPU 16^3 | 5.92 | 1.72 | 3.44 |
1 Host core 16^3 | 1.1 | 2.0 | 0.55 |
1 GPU 32^3 | 2.20 | 0.89 | 2.47 |
1 Host core 32^3 | 1.57 | 3.25 | 0.48 |
Note, that the ratio should interpreted with care (there's little "computation" and lots of (host) "management" tasks in the load balancing and refinement so it's not a fair comparison).
You may also notice that the startup time on GPUs is significantly longer than on host only, which is likely also related to the mesh initialization that also includes refinement and creating lots of blocks (and thus memory allocations).
- Build the connector (which allows named regions and kernel names to be properly shown in the profiler rather than the long names deduced from the templates)
git clone https://github.com/kokkos/kokkos-tools.git
cd kokkos-tools/profiling/nvprof-connector
export CUDA_ROOT=/usr/local/cuda
make
Now there should be a kp_nvprof_connector.so
file.
- Collect the data (using Nsight Systems as
nvprof
andnvvp
are deprecated)
# "enable" the Kokkos profiling tool
export KOKKOS_PROFILE_LIBRARY=/PATH/TO/kokkos-tools/profiling/nvprof-connector/kp_nvprof_connector.so
# collect the actual data
nsys profile -o my_profile ./example/advection/advection-example -i /grand/gpu_hack/parthenon/ref/parthinput.block16
Now there should be a my_profile.qdrep
file (that should be copied to system with a desktop environment).
- Analyze the data
- Start the GUI (
nsys-ui
) - Go to Tools - Options and set "Rename CUDA Kernels by NVTX" to "Yes" to get the Kokkos labels shown on the GUI.