Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add application debugging crash callback #1197

Draft
wants to merge 23 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/ci-extended.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ env:
CMAKE_BUILD_PARALLEL_LEVEL: 5 # num threads for build
MACHINE_CFG: cmake/machinecfg/CI.cmake
OMPI_MCA_mpi_common_cuda_event_max: 1000
# CUDA IPC within docker repeated seem to cause issue on the CI machine
OMPI_MCA_btl_smcuda_use_cuda_ipc: 0
# https://github.com/open-mpi/ompi/issues/4948#issuecomment-395468231
OMPI_MCA_btl_vader_single_copy_mechanism: none

Expand All @@ -34,7 +36,7 @@ jobs:
container:
image: ghcr.io/parthenon-hpc-lab/cuda11.6-mpi-hdf5-ascent
# map to local user id on CI machine to allow writing to build cache
options: --user 1001
options: --user 1001 --cap-add CAP_SYS_PTRACE --shm-size="8g" --ulimit memlock=134217728
steps:
- uses: actions/checkout@v3
with:
Expand Down
8 changes: 5 additions & 3 deletions .github/workflows/ci-short.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ env:
CMAKE_BUILD_PARALLEL_LEVEL: 5 # num threads for build
MACHINE_CFG: cmake/machinecfg/CI.cmake
OMPI_MCA_mpi_common_cuda_event_max: 1000
# CUDA IPC within docker repeated seem to cause issue on the CI machine
OMPI_MCA_btl_smcuda_use_cuda_ipc: 0
# https://github.com/open-mpi/ompi/issues/4948#issuecomment-395468231
OMPI_MCA_btl_vader_single_copy_mechanism: none

Expand All @@ -22,7 +24,7 @@ jobs:
container:
image: ghcr.io/parthenon-hpc-lab/cuda11.6-mpi-hdf5-ascent
# map to local user id on CI machine to allow writing to build cache
options: --user 1001
options: --user 1001 --cap-add CAP_SYS_PTRACE --shm-size="8g" --ulimit memlock=134217728
steps:
- uses: actions/checkout@v3
with:
Expand All @@ -47,7 +49,7 @@ jobs:
container:
image: ghcr.io/parthenon-hpc-lab/cuda11.6-mpi-hdf5-ascent
# map to local user id on CI machine to allow writing to build cache
options: --user 1001
options: --user 1001 --cap-add CAP_SYS_PTRACE --shm-size="8g" --ulimit memlock=134217728
steps:
- uses: actions/checkout@v3
with:
Expand Down Expand Up @@ -79,7 +81,7 @@ jobs:
container:
image: ghcr.io/parthenon-hpc-lab/cuda11.6-mpi-hdf5-ascent
# map to local user id on CI machine to allow writing to build cache
options: --user 1001
options: --user 1001 --cap-add CAP_SYS_PTRACE --shm-size="8g" --ulimit memlock=134217728
steps:
- uses: actions/checkout@v3
with:
Expand Down
24 changes: 12 additions & 12 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@
## Current develop

### Added (new features/APIs/variables/...)
- [[PR 1185]](https://github.com/parthenon-hpc-lab/parthenon/pull/1185/files) Bugfix to particle defragmentation
- [[PR 1185]](https://github.com/parthenon-hpc-lab/parthenon/pull/1185) Bugfix to particle defragmentation
- [[PR 1184]](https://github.com/parthenon-hpc-lab/parthenon/pull/1184) Fix swarm block neighbor indexing in 1D, 2D
- [[PR 1183]](https://github.com/parthenon-hpc-lab/parthenon/pull/1183) Fix particle leapfrog example initialization data
- [[PR 1179]](https://github.com/parthenon-hpc-lab/parthenon/pull/1179) Make a global variable for whether simulation is a restart
- [[PR 1171]](https://github.com/parthenon-hpc-lab/parthenon/pull/1171) Add PARTHENON_USE_SYSTEM_PACKAGES build option
- [[PR 1161]](https://github.com/parthenon-hpc-lab/parthenon/pull/1161) Make flux field Metadata accessible, add Metadata::CellMemAligned flag, small perfomance upgrades

### Changed (changing behavior/API/variables/...)
- [[PR 1191]](https://github.com/parthenon-hpc-lab/parthenon/pull/1191) Update Kokkos version to 4.4.1
- [[PR 1206]](https://github.com/parthenon-hpc-lab/parthenon/pull/1206) Leapfrog fix
- [[PR1203]](https://github.com/parthenon-hpc-lab/parthenon/pull/1203) Pin Ubuntu CI image
- [[PR1177]](https://github.com/parthenon-hpc-lab/parthenon/pull/1177) Make mesh-level boundary conditions usable without the "user" flag
- [[PR 1203]](https://github.com/parthenon-hpc-lab/parthenon/pull/1203) Pin Ubuntu CI image
- [[PR 1177]](https://github.com/parthenon-hpc-lab/parthenon/pull/1177) Make mesh-level boundary conditions usable without the "user" flag
- [[PR 1187]](https://github.com/parthenon-hpc-lab/parthenon/pull/1187) Make DataCollection::Add safer and generalize MeshBlockData::Initialize
- [[Issue 1165]](https://github.com/parthenon-hpc-lab/parthenon/issues/1165) Bump Kokkos submodule to 4.4.1
- [[PR 1171]](https://github.com/parthenon-hpc-lab/parthenon/pull/1171) Add PARTHENON_USE_SYSTEM_PACKAGES build option
- [[PR 1172]](https://github.com/parthenon-hpc-lab/parthenon/pull/1172) Make parthenon manager robust against external MPI init and finalize calls

Expand All @@ -32,7 +32,7 @@


### Incompatibilities (i.e. breaking changes)
- [[PR1177]](https://github.com/parthenon-hpc-lab/parthenon/pull/1177) Make mesh-level boundary conditions usable without the "user" flag
- [[PR 1177]](https://github.com/parthenon-hpc-lab/parthenon/pull/1177) Make mesh-level boundary conditions usable without the "user" flag

## Release 24.08
Date: 2024-08-30
Expand Down Expand Up @@ -156,12 +156,12 @@ Date: 2024-03-21
- [[PR 973]](https://github.com/parthenon-hpc-lab/parthenon/pull/973) Multigrid performance upgrades

### Fixed (not changing behavior/API/variables/...)
- [[PR1023]](https://github.com/parthenon-hpc-lab/parthenon/pull/1023) Fix broken param of a scalar bool
- [[PR1012]](https://github.com/parthenon-hpc-lab/parthenon/pull/1012) Remove accidentally duplicated code
- [[PR992]](https://github.com/parthenon-hpc-lab/parthenon/pull/992) Allow custom PR ops with sparse pools
- [[PR988]](https://github.com/parthenon-hpc-lab/parthenon/pull/988) Fix bug in neighbor finding routine for small, periodic, refined meshes
- [[PR986]](https://github.com/parthenon-hpc-lab/parthenon/pull/986) Fix bug in sparse boundary communication BndInfo cacheing
- [[PR978]](https://github.com/parthenon-hpc-lab/parthenon/pull/978) remove erroneous sparse check
- [[PR 1023]](https://github.com/parthenon-hpc-lab/parthenon/pull/1023) Fix broken param of a scalar bool
- [[PR 1012]](https://github.com/parthenon-hpc-lab/parthenon/pull/1012) Remove accidentally duplicated code
- [[PR 992]](https://github.com/parthenon-hpc-lab/parthenon/pull/992) Allow custom PR ops with sparse pools
- [[PR 988]](https://github.com/parthenon-hpc-lab/parthenon/pull/988) Fix bug in neighbor finding routine for small, periodic, refined meshes
- [[PR 986]](https://github.com/parthenon-hpc-lab/parthenon/pull/986) Fix bug in sparse boundary communication BndInfo cacheing
- [[PR 978]](https://github.com/parthenon-hpc-lab/parthenon/pull/978) remove erroneous sparse check

### Infrastructure (changes irrelevant to downstream codes)
- [[PR 1027]](https://github.com/parthenon-hpc-lab/parthenon/pull/1027) Refactor RestartReader as abstract class
Expand Down Expand Up @@ -228,7 +228,7 @@ Date: 2023-11-16
- [[PR 901]](https://github.com/parthenon-hpc-lab/parthenon/pull/901) Implement shared element ownership model

### Removed (removing behavior/API/varaibles/...)
- [[PR 930](https://github.com/parthenon-hpc-lab/parthenon/pull/930) Remove ParthenonManager::ParthenonInit as it is error-prone and the split functions are the recommended usage.
- [[PR 930]](https://github.com/parthenon-hpc-lab/parthenon/pull/930) Remove ParthenonManager::ParthenonInit as it is error-prone and the split functions are the recommended usage.


## Release 0.8.0
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Parthenon -- a performance portable block-structured adaptive mesh refinement fr

* CMake 3.16 or greater
* C++17 compatible compiler
* Kokkos 4.0.1 or greater
* Kokkos 4.4.1 or greater

## Optional (enabling features)

Expand Down
1 change: 1 addition & 0 deletions cmake/machinecfg/GitHubActions.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ message(STATUS "Loading machine configuration for GitHub Actions CI. ")

# common options
set(NUM_MPI_PROC_TESTING "2" CACHE STRING "CI runs tests with 2 MPI ranks")
set(Kokkos_ENABLE_ROCTHRUST OFF CACHE BOOL "Temporarily disabled as the container needs to be updated to the `-complete` base image.")

set(MACHINE_CXX_FLAGS "")
if (${MACHINE_VARIANT} MATCHES "cuda")
Expand Down
28 changes: 28 additions & 0 deletions doc/sphinx/src/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,34 @@ parallelism interface that is needed for managing memory cached in
tightly nested loops. The wrappers are documented
:ref:`here <nested par for>`.

View of Views
-------------

Special care needs to be taken when working with a ``View`` of ``View``.

To repeat the Kokkos documenation: `Don't use them <https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#can-i-make-a-view-of-views>`__

But if you have to (which is the case in some places inside Parthenon)
then follow this pattern:

.. code:: c++

Kokkos::View<ParArray1D<Real> *> view_of_pararrays(parthenon::ViewOfViewAlloc("myname"), 10);

The ``ViewOfViewAlloc`` ensures that the ``Kokkos::SequentialHostInit`` property is added,
which results in the (inner ``View`` ) deallocators being called on the host (rather than on
the device by default).

Similarly, when you create a host mirror of said ``View`` of ``View`` add the additional
property for the same reason.

.. code:: c++

auto view_of_pararrays_h =
Kokkos::create_mirror_view(Kokkos::view_alloc(Kokkos::SequentialHostInit), view_of_pararrays);

Note that the ``SequentialHostInit`` was only added in Kokkos 4.4.1 (which is now the default in Parthenon).

The need for reductions within function handling ``MeshBlock`` data
-------------------------------------------------------------------

Expand Down
2 changes: 1 addition & 1 deletion external/Kokkos
Submodule Kokkos updated 1031 files
3 changes: 0 additions & 3 deletions src/bvals/bvals.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,6 @@ class BoundarySwarm : public BoundaryCommunication {
explicit BoundarySwarm(std::weak_ptr<MeshBlock> pmb, const std::string &label);
~BoundarySwarm() = default;

std::vector<ParArrayND<int>> vars_int;
std::vector<ParArrayND<Real>> vars_real;

// (usuallly the std::size_t unsigned integer type)
std::vector<BoundaryCommunication *>::size_type bswarm_index;

Expand Down
5 changes: 3 additions & 2 deletions src/bvals/comms/bnd_info.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,9 @@
namespace parthenon {

void ProResCache_t::Initialize(int n_regions, StateDescriptor *pkg) {
prores_info = ParArray1D<ProResInfo>("prores_info", n_regions);
prores_info_h = Kokkos::create_mirror_view(prores_info);
prores_info = ProResInfoArr_t(ViewOfViewAlloc("prores_info"), n_regions);
prores_info_h = Kokkos::create_mirror_view(
Kokkos::view_alloc(Kokkos::SequentialHostInit), prores_info);
int nref_funcs = pkg->NumRefinementFuncs();
// Note that assignment of Kokkos views resets them, but
// buffer_subset_sizes is a std::vector. It must be cleared, then
Expand Down
7 changes: 4 additions & 3 deletions src/bvals/comms/bnd_info.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
#include "bvals/neighbor_block.hpp"
#include "coordinates/coordinates.hpp"
#include "interface/variable_state.hpp"
#include "kokkos_abstraction.hpp"
#include "mesh/domain.hpp"
#include "mesh/forest/logical_coordinate_transformation.hpp"
#include "utils/communication_buffer.hpp"
Expand Down Expand Up @@ -127,11 +128,11 @@ struct ProResInfo {
int GetBufferSize(MeshBlock *pmb, const NeighborBlock &nb,
std::shared_ptr<Variable<Real>> v);

using BndInfoArr_t = ParArray1D<BndInfo>;
using BndInfoArr_t = Kokkos::View<BndInfo *, LayoutWrapper, DevMemSpace>;
using BndInfoArrHost_t = typename BndInfoArr_t::HostMirror;

using ProResInfoArr_t = ParArray1D<ProResInfo>;
using ProResInfoArrHost_t = typename ParArray1D<ProResInfo>::HostMirror;
using ProResInfoArr_t = Kokkos::View<ProResInfo *, LayoutWrapper, DevMemSpace>;
using ProResInfoArrHost_t = typename ProResInfoArr_t::HostMirror;
class StateDescriptor;
struct ProResCache_t {
ProResInfoArr_t prores_info{};
Expand Down
6 changes: 4 additions & 2 deletions src/bvals/comms/bvals_utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
#include "bvals/comms/bnd_info.hpp"
#include "bvals/comms/bvals_in_one.hpp"
#include "interface/variable.hpp"
#include "kokkos_abstraction.hpp"
#include "mesh/domain.hpp"
#include "mesh/mesh.hpp"
#include "mesh/meshblock.hpp"
Expand Down Expand Up @@ -215,8 +216,9 @@ inline void RebuildBufferCache(std::shared_ptr<MeshData<Real>> md, int nbound,
using namespace loops;
using namespace loops::shorthands;
BvarsSubCache_t &cache = md->GetBvarsCache().GetSubCache(BOUND_TYPE, SENDER);
cache.bnd_info = BndInfoArr_t("bnd_info", nbound);
cache.bnd_info_h = Kokkos::create_mirror_view(cache.bnd_info);
cache.bnd_info = BndInfoArr_t(ViewOfViewAlloc("bnd_info"), nbound);
cache.bnd_info_h = Kokkos::create_mirror_view(
Kokkos::view_alloc(Kokkos::SequentialHostInit), cache.bnd_info);

// prolongation/restriction sub-sets
// TODO(JMM): Right now I exclude fluxcorrection boundaries but if
Expand Down
18 changes: 15 additions & 3 deletions src/driver/driver.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ DriverStatus EvolutionDriver::Execute() {
// Defaults must be set across all ranks
DumpInputParameters();

DriverStatus driver_status = DriverStatus::complete;

{ // Main t < tmax loop region
PARTHENON_INSTRUMENT
while (tm.KeepGoing() && signal != OutputSignal::analysis) {
Expand Down Expand Up @@ -133,6 +135,14 @@ DriverStatus EvolutionDriver::Execute() {
// check for signals
signal = SignalHandler::CheckSignalFlags();

// TODO(bwibking): check for application debug callback
// currently hard-coded to check for tiny dt
if (tm.dt < 1e-6 * tm.time) {
signal = OutputSignal::final;
driver_status = DriverStatus::failed;
// do not return here, since we still want to write an output
}

if (signal == OutputSignal::final) {
break;
}
Expand All @@ -155,14 +165,16 @@ DriverStatus EvolutionDriver::Execute() {
pmesh->UserWorkAfterLoop(pmesh, pinput, tm);
}

DriverStatus status = tm.KeepGoing() ? DriverStatus::timeout : DriverStatus::complete;
if (driver_status != DriverStatus::failed) {
driver_status = tm.KeepGoing() ? DriverStatus::timeout : DriverStatus::complete;
}
// Do *not* write the "final" output, if this is analysis run.
// The analysis output itself has already been written above before the main loop.
if (signal != OutputSignal::analysis) {
pouts->MakeOutputs(pmesh, pinput, &tm, OutputSignal::final);
}
PostExecute(status);
return status;
PostExecute(driver_status);
return driver_status;
}

void EvolutionDriver::PostExecute(DriverStatus status) {
Expand Down
7 changes: 5 additions & 2 deletions src/interface/mesh_data.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
#include "interface/sparse_pack_base.hpp"
#include "interface/swarm_pack_base.hpp"
#include "interface/variable_pack.hpp"
#include "kokkos_abstraction.hpp"
#include "mesh/domain.hpp"
#include "mesh/meshblock.hpp"
#include "mesh/meshblock_pack.hpp"
Expand Down Expand Up @@ -149,8 +150,10 @@ const MeshBlockPack<P> &PackOnMesh(M &map, BlockDataList_t<Real> &block_data_,
}

if (make_new_pack) {
ParArray1D<P> packs("MeshData::PackVariables::packs", nblocks);
auto packs_host = Kokkos::create_mirror_view(packs);
Kokkos::View<P *, LayoutWrapper, DevMemSpace> packs(
ViewOfViewAlloc("MeshData::PackVariables::packs"), nblocks);
auto packs_host =
Kokkos::create_mirror_view(Kokkos::view_alloc(Kokkos::SequentialHostInit), packs);

for (size_t i = 0; i < nblocks; i++) {
const auto &pack = packing_function(block_data_[i], this_map, this_key);
Expand Down
11 changes: 7 additions & 4 deletions src/interface/sparse_pack_base.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
#include "interface/sparse_pack_base.hpp"
#include "interface/state_descriptor.hpp"
#include "interface/variable.hpp"
#include "kokkos_abstraction.hpp"
#include "utils/utils.hpp"
namespace parthenon {
namespace impl {
Expand Down Expand Up @@ -151,8 +152,9 @@ SparsePackBase SparsePackBase::Build(T *pmd, const PackDescriptor &desc,
} else if (contains_face_or_edge) {
leading_dim += 2;
}
pack.pack_ = pack_t("data_ptr", leading_dim, pack.nblocks_, max_size);
pack.pack_h_ = Kokkos::create_mirror_view(pack.pack_);
pack.pack_ = pack_t(ViewOfViewAlloc("data_ptr"), leading_dim, pack.nblocks_, max_size);
pack.pack_h_ = Kokkos::create_mirror_view(
Kokkos::view_alloc(Kokkos::SequentialHostInit), pack.pack_);

// For non-flat packs, shape of pack is type x block x var x k x j x i
// where type here might be a flux.
Expand All @@ -167,8 +169,9 @@ SparsePackBase SparsePackBase::Build(T *pmd, const PackDescriptor &desc,
pack.block_props_ = block_props_t("block_props", nblocks, 27 + 1);
pack.block_props_h_ = Kokkos::create_mirror_view(pack.block_props_);

pack.coords_ = coords_t("coords", desc.flat ? max_size : nblocks);
auto coords_h = Kokkos::create_mirror_view(pack.coords_);
pack.coords_ = coords_t(ViewOfViewAlloc("coords"), desc.flat ? max_size : nblocks);
auto coords_h = Kokkos::create_mirror_view(
Kokkos::view_alloc(Kokkos::SequentialHostInit), pack.coords_);

// Fill the views
int idx = 0;
Expand Down
6 changes: 4 additions & 2 deletions src/interface/sparse_pack_base.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
#include "interface/state_descriptor.hpp"
#include "interface/variable.hpp"
#include "interface/variable_state.hpp"
#include "kokkos_abstraction.hpp"
#include "utils/utils.hpp"

namespace parthenon {
Expand All @@ -55,13 +56,14 @@ class SparsePackBase {

using alloc_t = std::vector<int>;
using include_t = std::vector<bool>;
using pack_t = ParArray3D<ParArray3D<Real, VariableState>>;
using pack_t =
Kokkos::View<ParArray3D<Real, VariableState> ***, LayoutWrapper, DevMemSpace>;
using pack_h_t = typename pack_t::HostMirror;
using bounds_t = ParArray3D<int>;
using bounds_h_t = typename bounds_t::HostMirror;
using block_props_t = ParArray2D<int>;
using block_props_h_t = typename block_props_t::HostMirror;
using coords_t = ParArray1D<ParArray0D<Coordinates_t>>;
using coords_t = Kokkos::View<ParArray0D<Coordinates_t> *, LayoutWrapper, DevMemSpace>;

// Returns a SparsePackBase object that is either newly created or taken
// from the cache in pmd. The cache itself handles the all of this logic
Expand Down
Loading
Loading