-
Notifications
You must be signed in to change notification settings - Fork 37
Argonne 2021 GPU Hackathon
For general instruction see ThetaGPU doc.
Specifically, check the bottom of that doc if you're using bash
as there are special instructions.
- Login to one of the GPU service nodes
thetagpusn1,2
ssh thetagpusn1
- Put the follow to your
~/.bashrc
(or similar) or execute after each login.
# proxies so that can clone Parthenon from the internet (and get other data if required)
export http_proxy=http://proxy.tmi.alcf.anl.gov:3128
export https_proxy=http://proxy.tmi.alcf.anl.gov:3128
# A more recent version is currently not available from the module system so we set the path manually
export PATH=/soft/buildtools/cmake/3.14.5/bin:$PATH
# Parthenon machine file that includes all required paths and options
export MACHINE_CFG=/grand/gpu_hack/parthenon/ref/ThetaGPU.cmake
- Compiling needs to be done on the DGX nodes, e.g., in an interactive session via
qsub -t 60 -n 1 -q single-gpu -A gpu_hack -I
- Get Parthenon
git clone https://github.com/lanl/parthenon.git
cd parthenon
git submodule init
git submodule update
- Build Parthenon
mkdir build
cd build
# builds for cuda with mpi (default)
cmake ..
make
# OR cuda and no mpi
cmake -DMACHINE_VARIANT=cuda ..
make
# OR host(gcc) with mpi
cmake -DMACHINE_VARIANT=mpi ..
make
Two input files are located in the /grand/gpu_hack/parthenon/ref/
folder
-
parthinput.block32
with 32^3 blocks -
parthinput.block16
with 16^3 blocks (add more stress to the AMR part)
A sample output may look like
./example/advection/advection-example -i /grand/gpu_hack/parthenon/ref/parthinput.block16
...
cycle=3 time=8.7890624999999991e-04 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.81e+07 wsec_step=1.72e+00 zone-cycles/wsec=4.08e+06 wsec_AMR=5.92e+00
-------------- New Mesh structure after (de)refinement -------------
Root grid = 16 x 16 x 16 MeshBlocks
Total number of MeshBlocks = 7687
Number of physical refinement levels = 2
Number of logical refinement levels = 6
Physical level = 0 (logical level = 4): 3753 MeshBlocks, cost = 3753
Physical level = 1 (logical level = 5): 2574 MeshBlocks, cost = 2574
Physical level = 2 (logical level = 6): 1360 MeshBlocks, cost = 1360
--------------------------------------------------------------------
cycle=4 time=1.1718750000000000e-03 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.80e+07 wsec_step=1.75e+00 zone-cycles/wsec=1.80e+07 wsec_AMR=4.92e-04
cycle=5 time=1.4648437500000000e-03 dt=2.9296874999999999e-04 zone-cycles/wsec_step=2.09e+07 wsec_step=1.50e+00 zone-cycles/wsec=2.09e+07 wsec_AMR=5.09e-04
Driver completed.
time=1.46e-03 cycle=5
tlim=1.00e+00 nlim=5
Number of MeshBlocks = 7687; 3591 created, 0 destroyed during this simulation.
The interesting/relevant information here are
- the "New Mesh structure after (de)refinement" message indicates that load balancing and/or mesh refinement happened
- the performance per cycle, e.g.,
cycle=3 time=8.7890624999999991e-04 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.81e+07 wsec_step=1.72e+00 zone-cycles/wsec=4.08e+06 wsec_AMR=5.92e+00
where the last number shows how much time (in wall seconds) was spent just doing AMR/loadbalancing. In this example this is 5.92 seconds compared to 1.72 seconds (wsec_step=1.72e+00
) required for a timestep itself (without load balancing/AMR). Note that the timestep number is artificially bad given that not all parts of the test problem have been converted to the "pack of blocks" (MeshBlockPack
) approach. - Thus, the (or one of the) main goal of the Hackathon is to reduce the
wsec_AMR
time (as thewsec_step
will automatically become better along the way with existing approaches)
For the config files provided above on ThetaGPU
Config | wsec_AMR |
wsec_step |
ratio |
---|---|---|---|
1 GPU 16^3 | 5.92 | 1.72 | 3.44 |
1 Host core 16^3 | 1.1 | 2.0 | 0.55 |
1 GPU 32^3 | 2.20 | 0.89 | 2.47 |
1 Host core 32^3 | 1.57 | 3.25 | 0.48 |
Note, that the ratio should interpreted with care (there's little "computation" and lots of (host) "management" tasks in the load balancing and refinement so it's not a fair comparison).
You may also notice that the startup time on GPUs is significantly longer than on host only, which is likely also related to the mesh initialization that also includes refinement and creating lots of blocks (and thus memory allocations).
- Build the connector (which allows named regions and kernel names to be properly shown in the profiler rather than the long names deduced from the templates)
git clone https://github.com/kokkos/kokkos-tools.git
cd kokkos-tools/profiling/nvprof-connector
export CUDA_ROOT=/usr/local/cuda
make
Now there should be a kp_nvprof_connector.so
file.
- Collect the data (using Nsight Systems as
nvprof
andnvvp
are deprecated)
# "enable" the Kokkos profiling tool
export KOKKOS_PROFILE_LIBRARY=/PATH/TO/kokkos-tools/profiling/nvprof-connector/kp_nvprof_connector.so
# collect the actual data
nsys profile -o my_profile ./example/advection/advection-example -i /grand/gpu_hack/parthenon/ref/parthinput.block16
Now there should be a my_profile.qdrep
file (that should be copied to system with a desktop environment).
- Analyze the data
- Start the GUI (
nsys-ui
) - Go to Tools - Options and set "Rename CUDA Kernels by NVTX" to "Yes" to get the Kokkos labels shown on the GUI.
<parthenon/job>
problem_id = advection
<parthenon/mesh>
refinement = adaptive
numlevel = 4
nx1 = 128
x1min = -1.50
x1max = 1.50
ix1_bc = periodic
ox1_bc = periodic
nx2 = 128
x2min = -1.50
x2max = 1.50
ix2_bc = periodic
ox2_bc = periodic
nx3 = 128
x3min = -1.50
x3max = 1.50
ix3_bc = periodic
ox3_bc = periodic
<parthenon/meshblock>
nx1 = 32
nx2 = 32
nx3 = 32
<parthenon/time>
tlim = 1.0
integrator = rk1
nlim = 10
perf_cycle_offset = 1
ncycle_out_mesh=-10
<Advection>
cfl = 0.30
vx = 1.0
vy = 1.0
vz = 1.0
profile = smooth_gaussian
ang_2 = 0.0
ang_3 = 0.0
ang_2_vert = false
ang_3_vert = false
amp = 1.0
refine_tol = 1.050 # control the package specific refinement tagging function
derefine_tol = 1.001
compute_error = false
num_vars = 10 # number of variables
vec_size = 1 # size of each variable
fill_derived = false # whether to fill one-copy test vars
buffer_send_pack = true # send all buffers using packs
buffer_recv_pack = true # receive buffers using packs
buffer_set_pack = true # set received buffers using packs
- should use about 14G of memory
- output should look like
#Variables in use:
# Package: parthenon::resolved_state
# ---------------------------------------------------
# Variables:
# Name Metadata flags
# ---------------------------------------------------
advected_4 Provides,Cell,Independent,FillGhost
advected_5 Provides,Cell,Independent,FillGhost
advected_6 Provides,Cell,Independent,FillGhost
advected_3 Provides,Cell,Independent,FillGhost
advected_1 Provides,Cell,Independent,FillGhost
advected_8 Provides,Cell,Independent,FillGhost
advected_7 Provides,Cell,Independent,FillGhost
advected_9 Provides,Cell,Independent,FillGhost
advected Provides,Cell,Independent,FillGhost
advected_2 Provides,Cell,Independent,FillGhost
# ---------------------------------------------------
# Sparse Variables:
# Name sparse id Metadata flags
# ---------------------------------------------------
# ---------------------------------------------------
# Swarms:
# Swarm Value metadata
# ---------------------------------------------------
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
Setup complete, executing driver...
cycle=0 time=0.0000000000000000e+00 dt=8.7890624999999991e-04 zone-cycles/wsec_step=0.00e+00 wsec_step=5.40e-03 zone-cycles/wsec=0.00e+00 wsec_AMR=0.00e+00
---------------------- Current Mesh structure ----------------------
Root grid = 4 x 4 x 4 MeshBlocks
Total number of MeshBlocks = 232
Number of physical refinement levels = 3
Number of logical refinement levels = 5
Physical level = 0 (logical level = 2): 56 MeshBlocks, cost = 56
Physical level = 1 (logical level = 3): 56 MeshBlocks, cost = 56
Physical level = 2 (logical level = 4): 56 MeshBlocks, cost = 56
Physical level = 3 (logical level = 5): 64 MeshBlocks, cost = 64
--------------------------------------------------------------------
cycle=1 time=8.7890624999999991e-04 dt=8.7890624999999991e-04 zone-cycles/wsec_step=0.00e+00 wsec_step=7.78e-01 zone-cycles/wsec=0.00e+00 wsec_AMR=4.65e-06
cycle=2 time=1.7578124999999998e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.05e+07 wsec_step=3.71e-01 zone-cycles/wsec=2.05e+07 wsec_AMR=3.87e-06
cycle=3 time=2.6367187499999997e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.05e+07 wsec_step=3.70e-01 zone-cycles/wsec=2.05e+07 wsec_AMR=3.49e-06
cycle=4 time=3.5156249999999997e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.06e+07 wsec_step=3.70e-01 zone-cycles/wsec=2.06e+07 wsec_AMR=3.61e-06
cycle=5 time=4.3945312500000000e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.06e+07 wsec_step=3.68e-01 zone-cycles/wsec=2.06e+07 wsec_AMR=3.54e-06
cycle=6 time=5.2734375000000003e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.06e+07 wsec_step=3.69e-01 zone-cycles/wsec=2.06e+07 wsec_AMR=3.63e-06
cycle=7 time=6.1523437500000007e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.05e+07 wsec_step=3.70e-01 zone-cycles/wsec=2.34e+06 wsec_AMR=2.88e+00
-------------- New Mesh structure after (de)refinement -------------
Root grid = 4 x 4 x 4 MeshBlocks
Total number of MeshBlocks = 484
Number of physical refinement levels = 3
Number of logical refinement levels = 5
Physical level = 0 (logical level = 2): 44 MeshBlocks, cost = 44
Physical level = 1 (logical level = 3): 140 MeshBlocks, cost = 140
Physical level = 2 (logical level = 4): 140 MeshBlocks, cost = 140
Physical level = 3 (logical level = 5): 160 MeshBlocks, cost = 160
--------------------------------------------------------------------
cycle=8 time=7.0312500000000010e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=1.18e+07 wsec_step=1.35e+00 zone-cycles/wsec=1.18e+07 wsec_AMR=1.11e-05
cycle=9 time=7.9101562500000014e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=1.96e+07 wsec_step=8.08e-01 zone-cycles/wsec=1.96e+07 wsec_AMR=9.08e-06
cycle=10 time=8.7890625000000017e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=1.97e+07 wsec_step=8.05e-01 zone-cycles/wsec=1.97e+07 wsec_AMR=8.55e-06
---------------------- Current Mesh structure ----------------------
Root grid = 4 x 4 x 4 MeshBlocks
Total number of MeshBlocks = 484
Number of physical refinement levels = 3
Number of logical refinement levels = 5
Physical level = 0 (logical level = 2): 44 MeshBlocks, cost = 44
Physical level = 1 (logical level = 3): 140 MeshBlocks, cost = 140
Physical level = 2 (logical level = 4): 140 MeshBlocks, cost = 140
Physical level = 3 (logical level = 5): 160 MeshBlocks, cost = 160
--------------------------------------------------------------------
Driver completed.
time=8.79e-03 cycle=10
tlim=1.00e+00 nlim=10
Number of MeshBlocks = 484; 420 created, 0 destroyed during this simulation.
walltime used = 8.06e+00
zone-cycles/wallsecond = 1.16e+07