Non-contiguous data handling, restructure for multiple transports #100

cwpearson · 2024-06-20T22:46:57Z

This is a big PR that does a lot of things that I felt were interrelated

Prepare for non-MPI "transports"

This is all modeled after Kokkos Core. Each transport will have its own subdirectory under src (like Kokkos' backends). The interface that each transport needs to implement is a struct. For example, Irecv, from KokkosComm_fwd.hpp:

template <KokkosView RecvView, KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace,
          Transport TRANSPORT = DefaultTransport>
struct Irecv;

Why a struct? Because in practice what this means is that the transport will have a partial specialization (not allowed for functions) that looks like this (from mpi/KokkosComm_mpi_irecv.hpp):

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Mpi> { ... };

Where Mpi is a struct in the KokkosComm namespace, analogous to e.g. Kokkos::Cuda. A Serial transport would have a corresponding

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Serial> { ... };

NCCL would have

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, NCCL> { ... };

and so on.

To support this, a lot of the include structure needs to be refined and adjusted to more closely match how Kokkos Core does it.

Future work:

can we have two enabled transports? If so, do we have a preferred and a fallback? Does choosing between them need to happen at runtime?
Convert the rest of our MPI APIs into this style.

Non-contiguous view handling in MPI

Now that we have a place to implement MPI-specific things, this is where we can do non-contiguous data handling strategies. Originally I thought this was orthogonal to the transport, but it is not - consider if we want to use MPI Derived Datatypes to handle non-contiguous data.

The current implementation basically defines a sequence of seven "phases" to coordinate Kokkos execution space instances, MPI communicator, and associated non-contiguous data handling operations:

Five phases to get the communication posted

init_fence (optional fence before host-side calls)
allocations (optional allocations)
pre_copies (optional data copies)
pre_comm_fence (optional fence before host-side calls)
comm (actually make MPI calls)

Plus two more that happen after the communication has posted

wait (optional wait for MPI calls)
post-wait (optional operations after waits)

All MPI operations + non-contiguous data handling must be fit into these 7 phases.

Future work:

are these the right 7 phases? In any case, this is an implementation detail that can be changed.

Reduce fencing due to host-side MPI calls

One problem with a higher-level interface as we've started defining it today is that people want to do this:

Kokkos::parallel_for("fill send buffers", space, ...);
space.fence(); // just one fence
MPI_Isend(...);
MPI_Isend
MPI_Isend
...

However, our semantics say that the communication is ordered in the space, which means for non-host spaces we have to fence internally in our e.g. isend before we actually call MPI_Isend. So for us, we get some pointless fences

Kokkos::parallel_for("fill send buffers", space, ...);
KokkosComm::isend(space, ...); // implicit fence
KokkosComm::isend(space, ...); // implicit fence
KokkosComm::isend(space, ...); // implicit fence
...

This introduces a "plan" construct like this:

std::vector<KokkosComm::Req<>> reqs = KokkosComm::plan(space, comm, [=](KokkosComm::Handle<Space> &handle) {
    KokkosComm::isend(handle, xp1_s, get_rank(xp1, ry), 0);
    KokkosComm::isend(handle, xm1_s, get_rank(xm1, ry), 1);
    KokkosComm::isend(handle, yp1_s, get_rank(rx, yp1), 2);
    KokkosComm::isend(handle, ym1_s, get_rank(rx, ym1), 3);
  });

Now our pure KokkosComm APIs would take this handle argument, though which the implementation tells the plan function whether it needs fences, and how its operation is implemented in terms of the 7 phases. The plan looks at all of those and issues the minimal number of fences. This returns one KokkosCom::Req for each async operation in the lambda.

We can still have a lower-level search-and-replace style API for interop as well. It can be implemented in terms of this plan, or more directly.

KokkosComm::wait, wait_any, wait_all

Free-standing functions to wait on KokkosComm::Req

Replaces #64 and #32

aprokop · 2024-06-21T11:23:05Z

How are you going to implement a loop over (waitany, process) using handles? For example, if receiving halos from multiple processors, wait for any to arrive, process its data, then wait for the next to arrive. To overlap communication and computation.

cwpearson · 2024-06-24T15:30:25Z

How are you going to implement a loop over (waitany, process) using handles? For example, if receiving halos from multiple processors, wait for any to arrive, process its data, then wait for the next to arrive. To overlap communication and computation.

I made the plan thing return an array of Req instead, implemented wait_any, and I made a big update to the PR comment.

dutkalex · 2024-06-24T16:36:48Z

How are you going to implement a loop over (waitany, process) using handles? For example, if receiving halos from multiple processors, wait for any to arrive, process its data, then wait for the next to arrive. To overlap communication and computation.

I made the plan thing return an array of Req instead, implemented wait_any, and I made a big update to the PR comment.

Do you think it would make sense to allow other kinds of statements inside of the plan lambda? Like computations which could be done on the fly upon reception of some data while waiting for other communications?

cwpearson · 2024-06-26T17:29:46Z

Do you think it would make sense to allow other kinds of statements inside of the plan lambda? Like computations which could be done on the fly upon reception of some data while waiting for other communications?

I think this is an interesting idea as a follow-on...users may be tempted to do something like this:

std::vector<KokkosComm::Req<>> reqs = KokkosComm::plan(space, comm, [=](KokkosComm::Handle<Space> &handle) {
    KokkosComm::isend(handle, xp1_s, get_rank(xp1, ry), 0);
    Kokkos::parallel_for(space, ...) // roughly...
    KokkosComm::isend(handle, xm1_s, get_rank(xm1, ry), 1);
  });

But this actually won't work properly because the KokkosComm::isend(handle, ...) call doesn't immediately inject the communication into the stream, so we'd probably have to offer some kind of KokkosComm::parallel_for(handle, ...) that could be called inside this region and make it visible to the plan.

dutkalex · 2024-06-26T17:39:32Z

we'd probably have to offer some kind of KokkosComm::parallel_for(handle, ...) that could be called inside this region and make it visible to the plan

Yeah I was thinking of something along these lines, but I think we can avoid redefining new parallel_for functions by having the isend return a new handle:

auto reqs = KokkosComm::plan( space, comm, [=]( KokkosComm::Handle<Space>& handle ){
    KokkosComm::isend(handle, xp1_s, get_rank(xp1, ry), 0).and_then( 
        [=]{ Kokkos::parallel_for( space, ... ); }
    );
    KokkosComm::isend( handle, xm1_s, get_rank(xm1, ry), 1 );
} );

cwpearson self-assigned this Jun 20, 2024

cwpearson force-pushed the feature/handle-and-noncontiguous branch from 01dc842 to 3975bc6 Compare June 20, 2024 22:47

Initial attempt at comm/space wrapping and noncontig views

5b95a4c

cwpearson force-pushed the feature/handle-and-noncontiguous branch from 3975bc6 to 5b95a4c Compare June 20, 2024 22:51

cwpearson added 2 commits June 20, 2024 16:56

fix key generation

4c93906

gcc more picky than clang

9315f31

cwpearson added 4 commits June 21, 2024 12:21

Refactor with partial specialization

05975cf

h.wait() -> KokkosComm::wait

0823dc8

rework to support waitall

c1f1b67

remove some prints

cc3fcfc

cwpearson force-pushed the feature/handle-and-noncontiguous branch from 44ef546 to cc3fcfc Compare June 21, 2024 21:46

cwpearson changed the title ~~Another attempt at comm/space wrapping and noncontig views~~ Non-contiguous data, restructure for multiple transports Jun 26, 2024

cwpearson changed the title ~~Non-contiguous data, restructure for multiple transports~~ Non-contiguous data handling, restructure for multiple transports Jun 26, 2024

dssgabriel mentioned this pull request Jul 10, 2024

Clarify semantics of ExecSpace parameter for communication interfaces #108

Open

cwpearson mentioned this pull request Jul 10, 2024

Reorganize for MPI / NCCL transport layers #109

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-contiguous data handling, restructure for multiple transports #100

Non-contiguous data handling, restructure for multiple transports #100

cwpearson commented Jun 20, 2024 •

edited

Loading

aprokop commented Jun 21, 2024

cwpearson commented Jun 24, 2024 •

edited

Loading

dutkalex commented Jun 24, 2024

cwpearson commented Jun 26, 2024 •

edited

Loading

dutkalex commented Jun 26, 2024 •

edited

Loading

Non-contiguous data handling, restructure for multiple transports #100

Are you sure you want to change the base?

Non-contiguous data handling, restructure for multiple transports #100

Conversation

cwpearson commented Jun 20, 2024 • edited Loading

aprokop commented Jun 21, 2024

cwpearson commented Jun 24, 2024 • edited Loading

dutkalex commented Jun 24, 2024

cwpearson commented Jun 26, 2024 • edited Loading

dutkalex commented Jun 26, 2024 • edited Loading

cwpearson commented Jun 20, 2024 •

edited

Loading

cwpearson commented Jun 24, 2024 •

edited

Loading

cwpearson commented Jun 26, 2024 •

edited

Loading

dutkalex commented Jun 26, 2024 •

edited

Loading