Skip to content

Commit

Permalink
add docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jonahm-LANL committed Dec 11, 2023
1 parent 1b8681a commit fc9c3b4
Showing 1 changed file with 116 additions and 1 deletion.
117 changes: 116 additions & 1 deletion doc/sphinx/src/nested_par_for.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Data type for memory in scratch pad/cache memory. Use
documentation <https://kokkos.github.io/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html?highlight=hierarchical>`__
for determining scratch pad memory needs before kernel launch.

Important usage hints
On Barriers
---------------------

In order to ensure that individual threads of a team are synchronized
Expand All @@ -70,6 +70,7 @@ write to common variables, see this
`code <https://github.com/parthenon-hpc-lab/parthenon/issues/659#issuecomment-1346871509>`__
for an example.


Cmake Options
-------------

Expand All @@ -86,3 +87,117 @@ GPUs.
``#pragma omp simd`` to vectorize the loop, which typically gives better
vectorization loops than ``PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP`` on
CPUs and so is the default on CPUs.


Performance Considerations
---------------------------

Hierarchical parallelism can produce very performant code, but a
deeper awareness of how hardware is mapped to threads is required to
get optimal performance. Here we list a few strategies/considerations.

* On CPU, with `SIMDFOR_INNER_LOOP` you may have trouble vectorizing
unless you help the compiler along. One way to do so is to work with
raw pointers to contiguous memory, rather than working with views
and strides. Even for stencil ops, if you can pull out pointers that
represent the different points on the stencil, this can help with
vectorization.
* Similarly on CPUs, due to the cost of starting up a vector op,
vectorization will only be a performance win if there's enough work
in the inner loop. A minimum of 16 points is required for the op to
vectorize at all. Experience shows, however, that at least 64 is
really required to see big wins. One strategy for providing enough
vector cells in the inner loop is to do a 1D ``SIMDFOR`` inner loop
but combine the ``j`` and ``i`` indices by simply looping over the
contiguous memory in a rasterized plane on a block.
* On Cuda GPUs, the outer loop typically maps to warps, while the
inner maps to threads. To see good performance, you must both
provide enough work in the inner loop to fill a warp and enough work
in the outer loop to fill all warps.

IndexSplit
-------------

To balance the CPU vs GPU hardware considerations of hierarchical
parallelism, ``Parthenon`` provides a utility, the ``IndexSplit``
class, defined in the ``utils/index_split.hpp`` header file and
available in ``<parthenon/package.hpp>`` in the
``parthenon::package::prelude`` namespace. The ``IndexSplit`` class
can be constructed as

.. code:: cpp
IndexSplit(MeshData<Real> md, IndexDomain domain, const int nkp, const int njp);
where here ``md`` is ``MeshData`` object on which you want to
operatore. ``domain`` specifies where in the ``MeshBlock`` you wish to
operate, for example ``IndexDomain::Interior``. ``nkp`` and ``njp``
are the number of points in ``X3`` and ``X2`` respectively that are in
the outer loop. All remaining points are in the inner loop. Typically
``MeshBlock`` index in the pack is also assumed to be in the outer
loop. ``nkp`` and ``njp`` also accept special flags
``IndexSplit::all_outer`` and ``IndexSplit::no_outer``, which specify
that all and none of the indices in that direction should be in the
outer loop.

A second constructor alternatively sets the range for ``X3``, ``X2``,
and ``X1`` explicitly:

.. code:: cpp
IndexSplit(MeshData<Real> *md, const IndexRange &kb, const IndexRange &jb,
const IndexRange &ib, const int nkp, const int njp);
where here ``kb``, ``jb``, and ``ib`` specify the starting and ending
indices for ``X3``, ``X2``, and ``X1`` respecively.

An ``IndexSplit`` object is typically used as:

.. code:: cpp
using namespace parthenon::package::prelude;
using parthenon::ScratchPad1D;
using parthenon::IndexSplit;
using parthenon::par_for_outer;
using parthenon::par_for_inner;
using parthenon::team_mbr_t;
// Initialize index split object
IndexSplit idx_sp(md, IndexDomain::interior, nkp, njp);
// Request maximum size in i and j in the inner loop, for scratch
const int Ni = idx_sp.get_max_ni();
const int Nj = idx_sp = get_max_nj();
const in tNmax = Ni * Nj;
// single scratch array for i,j
auto scratch_size = ScratchPad1D<Real>::shmem_size(Nmax);
constexpr int scratch_level = 0;
// Par for
par_for_outer(
DEFAULT_OUTER_LOOP_PATTERN, "KernalOuter", DevExecSpace(), scratch_size,
scratch_level, 0, nblocks - 1, 0, idx_sp.outer_size() - 1,
KOKKOS_LAMBDA(team_mbr_t member, const int b, const int outer_idx) {
ScratchPad1D<Real> scratch(member.team_scratch(scratch_level), Nmax);
// Get index ranges. Note they depend on where we are in the outer index!
// These give us a sense for where we are in k,j space
const auto krange = idx_sp.GetBoundsK(outer_idx);
const auto jrange = idx_sp.GetBoundsJ(outer_idx);
// This is the loop of contiguous inner memory. May contain i and j!
const auto irange = idx_sp.GetInnerBounds(jrange);
// Whatever part of k is not in the outer loop can be looped over
// with a normal for loop here
for (int k = krange.s; k <= krange.e; ++k) {
// pull out a pointer some variable in some pack. Note
// we pick the 0th index of i at k and jrange.s
Real *var = &pack(b, ivar, k, jrange.s, 0);
// Do something with the pointer in the inner loop.
par_for_inner(DEFAULT_INNER_LOOP_PATTERN, member, irange.s, irange.e,
[&](const int i) {
foo(var[i]);
});
}
});

0 comments on commit fc9c3b4

Please sign in to comment.