-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-contiguous data handling, restructure for multiple transports #100
base: develop
Are you sure you want to change the base?
Non-contiguous data handling, restructure for multiple transports #100
Conversation
01dc842
to
3975bc6
Compare
3975bc6
to
5b95a4c
Compare
How are you going to implement a loop over (waitany, process) using handles? For example, if receiving halos from multiple processors, wait for any to arrive, process its data, then wait for the next to arrive. To overlap communication and computation. |
44ef546
to
cc3fcfc
Compare
I made the |
Do you think it would make sense to allow other kinds of statements inside of the plan lambda? Like computations which could be done on the fly upon reception of some data while waiting for other communications? |
I think this is an interesting idea as a follow-on...users may be tempted to do something like this: std::vector<KokkosComm::Req<>> reqs = KokkosComm::plan(space, comm, [=](KokkosComm::Handle<Space> &handle) {
KokkosComm::isend(handle, xp1_s, get_rank(xp1, ry), 0);
Kokkos::parallel_for(space, ...) // roughly...
KokkosComm::isend(handle, xm1_s, get_rank(xm1, ry), 1);
}); But this actually won't work properly because the |
Yeah I was thinking of something along these lines, but I think we can avoid redefining new auto reqs = KokkosComm::plan( space, comm, [=]( KokkosComm::Handle<Space>& handle ){
KokkosComm::isend(handle, xp1_s, get_rank(xp1, ry), 0).and_then(
[=]{ Kokkos::parallel_for( space, ... ); }
);
KokkosComm::isend( handle, xm1_s, get_rank(xm1, ry), 1 );
} ); |
This is a big PR that does a lot of things that I felt were interrelated
Prepare for non-MPI "transports"
This is all modeled after Kokkos Core. Each transport will have its own subdirectory under src (like Kokkos' backends). The interface that each transport needs to implement is a struct. For example,
Irecv
, fromKokkosComm_fwd.hpp
:Why a
struct
? Because in practice what this means is that the transport will have a partial specialization (not allowed for functions) that looks like this (frommpi/KokkosComm_mpi_irecv.hpp
):Where
Mpi
is astruct
in theKokkosComm
namespace, analogous to e.g.Kokkos::Cuda
. A Serial transport would have a correspondingNCCL would have
and so on.
To support this, a lot of the include structure needs to be refined and adjusted to more closely match how Kokkos Core does it.
Future work:
Non-contiguous view handling in MPI
Now that we have a place to implement MPI-specific things, this is where we can do non-contiguous data handling strategies. Originally I thought this was orthogonal to the transport, but it is not - consider if we want to use MPI Derived Datatypes to handle non-contiguous data.
The current implementation basically defines a sequence of seven "phases" to coordinate Kokkos execution space instances, MPI communicator, and associated non-contiguous data handling operations:
Five phases to get the communication posted
Plus two more that happen after the communication has posted
All MPI operations + non-contiguous data handling must be fit into these 7 phases.
Future work:
Reduce fencing due to host-side MPI calls
One problem with a higher-level interface as we've started defining it today is that people want to do this:
However, our semantics say that the communication is ordered in the space, which means for non-host spaces we have to fence internally in our e.g.
isend
before we actually call MPI_Isend. So for us, we get some pointless fencesThis introduces a "plan" construct like this:
Now our pure KokkosComm APIs would take this
handle
argument, though which the implementation tells theplan
function whether it needs fences, and how its operation is implemented in terms of the 7 phases. Theplan
looks at all of those and issues the minimal number of fences. This returns oneKokkosCom::Req
for each async operation in the lambda.We can still have a lower-level search-and-replace style API for interop as well. It can be implemented in terms of this plan, or more directly.
KokkosComm::wait, wait_any, wait_all
Free-standing functions to wait on
KokkosComm::Req
Replaces #64 and #32