TACC Open Hackathon 2024

Some notes for organizing our efforts

!!! Need at least three people for every day

Agenda (all times CST)

Tues Oct 8 10 AM – 11:30 AM online
- Meet with mentor
Tues Oct 15 9 AM – 5 PM online
- Cluster intro
- Introductory team presentations
- Work with mentor
Tues Oct 22 – Thurs Oct 24 9 AM – 5 PM hybrid
- Work on code with mentor

Our Goals

Primary

Improve MPI scaling for Parthenon applications with many separately enrolled fields

Baseline input for fine_advection in https://github.com/parthenon-hpc-lab/parthenon/pull/1103:

# ========================================================================================
#  (C) (or copyright) 2020-2024. Triad National Security, LLC. All rights reserved.
#
#  This program was produced under U.S. Government contract 89233218CNA000001 for Los
#  Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC
#  for the U.S. Department of Energy/National Nuclear Security Administration. All rights
#  in the program are reserved by Triad National Security, LLC, and the U.S. Department
#  of Energy/National Nuclear Security Administration. The Government is granted for
#  itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide
#  license in this material to reproduce, prepare derivative works, distribute copies to
#  the public, perform publicly and display publicly, and to permit others to do so.
# ========================================================================================

<parthenon/job>
problem_id = advection

<parthenon/mesh>
refinement = none
#refinement = adaptive
#numlevel = 3

nx1 = 96
x1min = -0.5
x1max = 0.5
ix1_bc = periodic
ox1_bc = periodic

nx2 = 96
x2min = -0.5
x2max = 0.5
ix2_bc = periodic
ox2_bc = periodic

nx3 = 96
x3min = -0.5
x3max = 0.5
ix3_bc = periodic
ox3_bc = periodic

<parthenon/meshblock>
#nx1 = 96
#nx2 = 96
#nx3 = 96
# 27 meshblocks/GPU shows much worse scaling for the fragmented communication case
nx1 = 32
nx2 = 32
nx3 = 32

<parthenon/time>
nlim = 100
tlim = 100.0
integrator = rk2
ncycle_out_mesh = -10000

<Advection>
cfl = 0.45
vx = 1.0
vy = 1.0
vz = 1.0
profile = hard_sphere

# "Fragmented" mode
shape_size = 1
sparse_size = 10
# "Coalesced" mode
#shape_size = 10
#sparse_size = 1
do_regular_advection = true
do_fine_advection = false
do_CT_advection = false
alloc_threshold = 0.0
dealloc_threshold = 0.0

refine_tol = 0.3    # control the package specific refinement tagging function
derefine_tol = 0.03

Secondary

Improve buffer kernel performance for few (large) blocks

Sample input (using plain advection example)

<parthenon/job>
problem_id = advection

<parthenon/mesh>
refinement = none

nx1 = 256
x1min = -0.5
x1max = 0.5
ix1_bc = periodic
ox1_bc = periodic

nx2 = 256
x2min = -0.5
x2max = 0.5
ix2_bc = periodic
ox2_bc = periodic

nx3 = 256
x3min = -0.5
x3max = 0.5
ix3_bc = periodic
ox3_bc = periodic

<parthenon/meshblock>
nx1 = 128
nx2 = 128
nx3 = 128

<parthenon/time>
nlim = 25
tlim = 1.0
integrator = rk2
ncycle_out_mesh = -10000

<Advection>
cfl = 0.45
vx = 1.0
vy = 1.0
vz = 1.0
profile = hard_sphere

refine_tol = 0.3    # control the package specific refinement tagging function
derefine_tol = 0.03
compute_error = false
num_vars = 1 # number of variables
vec_size = 10 # size of each variable
fill_derived = false # whether to fill one-copy test vars

Current performance

Sample performance on a single GH200 (ran above with block sizes of 64, 128 and 256):

nb64.out:|-> 6.62e-02 sec 3.6% 100.0% 0.0% ------ 51 boundary_communication.cpp::96::SendBoundBufs [for]
nb128.out:|-> 1.44e-01 sec 11.0% 100.0% 0.0% ------ 51 boundary_communication.cpp::96::SendBoundBufs [for]
nb256.out:|-> 5.45e-01 sec 25.9% 100.0% 0.0% ------ 51 boundary_communication.cpp::96::SendBoundBufs [for]

nb64.out:|-> 8.81e-02 sec 4.8% 100.0% 0.0% ------ 51 boundary_communication.cpp::274::SetBounds [for]
nb128.out:|-> 1.69e-01 sec 12.9% 100.0% 0.0% ------ 51 boundary_communication.cpp::274::SetBounds [for]
nb256.out:|-> 6.44e-01 sec 30.6% 100.0% 0.0% ------ 51 boundary_communication.cpp::274::SetBounds [for]

Diagnose (and improve?) particle efficiency at scale

Example problem: particles-example

Multigrid performance

Example problem:

NCCL/RCCL evaluation

This would be a heavy lift to fully implement
Example problem:

CUDA asynchronous memory copies

Example problem:

Team

Ben Ryan

Secondary goal interests
- Particle scaling

Luke

Secondary goal interests
- Multigrid parallel performance

Philipp

Secondary goal interests
- Improve buffer kernel performance for few (large) blocks

Patrick

Secondary goal interests

Alex

Particles
NCCL

Nirmal

Secondary goal interests

Ben Prather

Secondary goal interests
- Single-meshblock bottlenecks
- Interface for downstreams to add CUDA async copies?

Jonah

Secondary goal interests

Getting Started on Vista

User guide: https://docs.tacc.utexas.edu/hpc/vista/

To log in:

ssh [tacc username]@vista.tacc.utexas.edu
[enter your TACC password]
[enter your TACC 2FA pin]

To get to your scratch space (purge policy should be ignorable by us for this hackathon):

cd $SCRATCH

To get parthenon:

git clone https://github.com/parthenon-hpc-lab/parthenon.git
git submodule update --init --recursive

To set up python for your user account:

module load phdf5
pip install numpy h5py

Two-hour interactive job on a Grace-Hopper node:

idev -p gh -N 1 -n 1 -m 120

To load the environment:

module load nvidia/24.9
module load openmpi/5.0.5_nvc249
module load phdf5

Configure the code

export NVCC_WRAPPER_DEFAULT_COMPILER=mpicxx
cmake -DKokkos_ENABLE_CUDA=ON -DPARTHENON_DISABLE_HDF5_COMPRESSION=ON -DKokkos_ARCH_HOPPER90=On -DCMAKE_CXX_COMPILER=/path/to/source/parthenon/external/Kokkos/bin/nvcc_wrapper -DCMAKE_C_COMPILER=mpicc /path/to/source/parthenon

Run the code with UCX workarounds:

export UCX_CUDA_COPY_DMABUF=n
export UCX_TLS=^gdr_copy
ibrun /path/to/build/example/advection/advection-example -i /path/to/source/parthenon/example/advection/parthinput.advection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly