currently, the explicit time advance (the core of the code) runs via calls into the kronmult library: https://github.com/project-asgard/kronmult. some kernels in https://github.com/project-asgard/asgard/blob/develop/src/device/kronmult_cuda.cpp are used to set up for calls into the library.
both the main kronmult code and the setup kernels are written as cuda kernels, with a fallback to OpenMP. To enhance portability, we could try a number of higher level approaches:
nvidia hpc sdk: https://developer.nvidia.com/hpc-sdk allows parallel algorithms https://en.cppreference.com/w/cpp/experimental/parallelism to be run on the accelerator. our code may not fit this paradigm, but may be worth exploring.
hipify kernels: https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-porting-guide.html.
others? kokkos, OpenCL, etc.
currently, the explicit time advance (the core of the code) runs via calls into the kronmult library: https://github.com/project-asgard/kronmult. some kernels in https://github.com/project-asgard/asgard/blob/develop/src/device/kronmult_cuda.cpp are used to set up for calls into the library.
both the main kronmult code and the setup kernels are written as cuda kernels, with a fallback to OpenMP. To enhance portability, we could try a number of higher level approaches:
nvidia hpc sdk: https://developer.nvidia.com/hpc-sdk allows parallel algorithms https://en.cppreference.com/w/cpp/experimental/parallelism to be run on the accelerator. our code may not fit this paradigm, but may be worth exploring.
hipify kernels: https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-porting-guide.html.
others? kokkos, OpenCL, etc.