add docs

parthenon-hpc-lab · Dec 11, 2023 · fc9c3b4 · fc9c3b4
1 parent 1b8681a
commit fc9c3b4
Showing 1 changed file with 116 additions and 1 deletion.
diff --git a/doc/sphinx/src/nested_par_for.rst b/doc/sphinx/src/nested_par_for.rst
@@ -51,7 +51,7 @@ Data type for memory in scratch pad/cache memory. Use
 documentation <https://kokkos.github.io/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html?highlight=hierarchical>`__
 for determining scratch pad memory needs before kernel launch.
 
-Important usage hints
+On Barriers
 ---------------------
 
 In order to ensure that individual threads of a team are synchronized
@@ -70,6 +70,7 @@ write to common variables, see this
 `code <https://github.com/parthenon-hpc-lab/parthenon/issues/659#issuecomment-1346871509>`__
 for an example.
 
+
 Cmake Options
 -------------
 
@@ -86,3 +87,117 @@ GPUs.
 ``#pragma omp simd`` to vectorize the loop, which typically gives better
 vectorization loops than ``PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP`` on
 CPUs and so is the default on CPUs.
+
+
+Performance Considerations
+---------------------------
+
+Hierarchical parallelism can produce very performant code, but a
+deeper awareness of how hardware is mapped to threads is required to
+get optimal performance. Here we list a few strategies/considerations.
+
+* On CPU, with `SIMDFOR_INNER_LOOP` you may have trouble vectorizing
+  unless you help the compiler along. One way to do so is to work with
+  raw pointers to contiguous memory, rather than working with views
+  and strides. Even for stencil ops, if you can pull out pointers that
+  represent the different points on the stencil, this can help with
+  vectorization.
+* Similarly on CPUs, due to the cost of starting up a vector op,
+  vectorization will only be a performance win if there's enough work
+  in the inner loop. A minimum of 16 points is required for the op to
+  vectorize at all. Experience shows, however, that at least 64 is
+  really required to see big wins. One strategy for providing enough
+  vector cells in the inner loop is to do a 1D ``SIMDFOR`` inner loop
+  but combine the ``j`` and ``i`` indices by simply looping over the
+  contiguous memory in a rasterized plane on a block.
+* On Cuda GPUs, the outer loop typically maps to warps, while the
+  inner maps to threads. To see good performance, you must both
+  provide enough work in the inner loop to fill a warp and enough work
+  in the outer loop to fill all warps.
+
+IndexSplit
+-------------
+
+To balance the CPU vs GPU hardware considerations of hierarchical
+parallelism, ``Parthenon`` provides a utility, the ``IndexSplit``
+class, defined in the ``utils/index_split.hpp`` header file and
+available in ``<parthenon/package.hpp>`` in the
+``parthenon::package::prelude`` namespace. The ``IndexSplit`` class
+can be constructed as
+
+.. code:: cpp
+
+  IndexSplit(MeshData<Real> md, IndexDomain domain, const int nkp, const int njp);
+
+where here ``md`` is ``MeshData`` object on which you want to
+operatore. ``domain`` specifies where in the ``MeshBlock`` you wish to
+operate, for example ``IndexDomain::Interior``. ``nkp`` and ``njp``
+are the number of points in ``X3`` and ``X2`` respectively that are in
+the outer loop. All remaining points are in the inner loop. Typically
+``MeshBlock`` index in the pack is also assumed to be in the outer
+loop. ``nkp`` and ``njp`` also accept special flags
+``IndexSplit::all_outer`` and ``IndexSplit::no_outer``, which specify
+that all and none of the indices in that direction should be in the
+outer loop.
+
+A second constructor alternatively sets the range for ``X3``, ``X2``,
+and ``X1`` explicitly:
+
+.. code:: cpp
+
+  IndexSplit(MeshData<Real> *md, const IndexRange &kb, const IndexRange &jb,
+             const IndexRange &ib, const int nkp, const int njp);
+
+where here ``kb``, ``jb``, and ``ib`` specify the starting and ending
+indices for ``X3``, ``X2``, and ``X1`` respecively.
+
+An ``IndexSplit`` object is typically used as:
+
+.. code:: cpp
+
+  using namespace parthenon::package::prelude;
+  using parthenon::ScratchPad1D;
+  using parthenon::IndexSplit;
+  using parthenon::par_for_outer;
+  using parthenon::par_for_inner;
+  using parthenon::team_mbr_t;
+  // Initialize index split object
+  IndexSplit idx_sp(md, IndexDomain::interior, nkp, njp);
+  
+  // Request maximum size in i and j in the inner loop, for scratch
+  const int Ni = idx_sp.get_max_ni();
+  const int Nj = idx_sp = get_max_nj();
+  const in tNmax = Ni * Nj;
+  
+  // single scratch array for i,j
+  auto scratch_size = ScratchPad1D<Real>::shmem_size(Nmax);
+  constexpr int scratch_level = 0;
+  
+  // Par for
+  par_for_outer(
+	  DEFAULT_OUTER_LOOP_PATTERN, "KernalOuter", DevExecSpace(), scratch_size,
+	  scratch_level, 0, nblocks - 1, 0, idx_sp.outer_size() - 1,
+	  KOKKOS_LAMBDA(team_mbr_t member, const int b, const int outer_idx) {
+	    ScratchPad1D<Real> scratch(member.team_scratch(scratch_level), Nmax);
+	    // Get index ranges. Note they depend on where we are in the outer index!
+	    // These give us a sense for where we are in k,j space
+	    const auto krange = idx_sp.GetBoundsK(outer_idx);
+	    const auto jrange = idx_sp.GetBoundsJ(outer_idx);
+	    // This is the loop of contiguous inner memory. May contain i and j!
+	    const auto irange = idx_sp.GetInnerBounds(jrange);
+
+	    // Whatever part of k is not in the outer loop can be looped over
+	    // with a normal for loop here
+	    for (int k = krange.s; k <= krange.e; ++k) {
+
+	      // pull out a pointer some variable in some pack. Note
+	      // we pick the 0th index of i at k and jrange.s
+	      Real *var = &pack(b, ivar, k, jrange.s, 0);
+
+	      // Do something with the pointer in the inner loop.
+	      par_for_inner(DEFAULT_INNER_LOOP_PATTERN, member, irange.s, irange.e,
+	        [&](const int i) {
+		  foo(var[i]);
+		});
+	    }
+	  });