Coalesced Buffer Communication #1192

lroberts36 · 2024-10-17T02:22:29Z

PR Summary

Coalesced buffer communication, see the included docs for a description. This came from the combined buffer communication work we did for the TACC Hackathon.

PR Checklist

lroberts36 · 2024-10-19T21:34:49Z

TODO:

Write code for sending message structure info on remesh and (sort of) test
Migrate combined buffers to CommBuffers
Point BndId object to associated combined buffers
Add combined buffer sends
Add combined buffer receives
Add combined buffer packing kernel (involves getting BndId arrays on device)
Add combined buffer unpacking kernel (involves getting BndId arrays on device)
Stop sending single buffers

src/bvals/comms/boundary_communication.cpp

brryan

These small comments are probably just a distraction right now but figured I would record them. Nothing stands out to me as an issue right now

src/bvals/comms/bnd_info.cpp

src/bvals/comms/bnd_info.hpp

src/bvals/comms/combined_buffers.cpp

src/bvals/comms/combined_buffers.hpp

src/bvals/comms/combined_buffers.cpp

src/bvals/comms/combined_buffers.hpp

brryan

LG(reat)TM! I only had some small queries, I didn't detect any issues in the logic.

doc/sphinx/src/boundary_communication.rst

example/fine_advection/advection_driver.cpp

src/basic_types.hpp

src/bvals/comms/boundary_communication.cpp

src/bvals/comms/coalesced_buffers.cpp

brryan · 2024-11-25T18:35:49Z

src/mesh/mesh.cpp

+      do_coalesced_comms{
+          pin->GetOrAddBoolean("parthenon/mesh", "do_coalesced_comms", true)} {


I think default to true is reasonable, we already have some evidence that your solution produces improved performance at least outside of AthenaPK-style workflows

src/utils/communication_buffer.hpp

BenWibking · 2024-11-25T19:49:41Z

The macOS CI failed with this error:

Start 38: Swarm memory management
38/70 Test #38: Swarm memory management ....................................................***Failed    0.03 sec
Filters: Swarm memory management
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[Mac-1732552[93](https://github.com/parthenon-hpc-lab/parthenon/actions/runs/12014756671/job/33491274932?pr=1192#step:8:94)2821.local:12435] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

~~The macOS runner uses OpenMPI, so I'm not sure why it would fail here but work for the Linux MPI CI...?~~

Edit: nvm, it fails here too: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/12014756878/job/33491277357?pr=1192

Co-authored-by: Ben Ryan <[email protected]>

lroberts36 · 2024-11-25T20:41:32Z

Edit: nvm, it fails here too: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/12014756878/job/33491277357?pr=1192

@BenWibking: Yeah, I think this is a problem with develop, not this particular PR. It looks like all other PRs are currently failing this too...

brryan · 2024-11-25T20:57:11Z

Edit: nvm, it fails here too: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/12014756878/job/33491277357?pr=1192

@BenWibking: Yeah, I think this is a problem with develop, not this particular PR. It looks like all other PRs are currently failing this too...

I'll take a look at this, I don't understand how Swarm CI issues keep getting into develop

Yurlungur

Really impressive that such a big feature is encapsulated in only a 1200 line diff. Nice work. Really excited to try this out. Testing it in riot on a few cores on my laptop now.

Yurlungur · 2024-11-26T16:32:58Z

doc/sphinx/src/boundary_communication.rst

+.. code::
+
+   parthenon/mesh/do_coalesced_comms = true
+
+curently by default this is set to ``true``.


If we think this works for all downstreams including kharma, artemis and riot I am in favor of the default being true. If there's some doubt, we should maybe change the default to false.

Yeah, I tend to lean toward default false until there is more downstream testing. To make sure it is passing regression tests, it needs to be set to true for now though (or we would have to change all the parameter input). There is some discussion of this above though where @brryan suggested we keep true.

I am fine with it being default true. But I would also be fine modifying all the tests to set it to true manually.

I'm in principle also happy with default true (assuming that all downstream codes work/perform as expected as others already noted).

doc/sphinx/src/boundary_communication.rst

Yurlungur · 2024-11-26T16:54:04Z

doc/sphinx/src/boundary_communication.rst

+- Currently, there is a ``Compare`` method in ``CoalescedBuffer`` that is just for 
+  debugging. It should compare the received coalesced messages to the variable-boundary buffer 
+  messages, but using it requires some hacks in the code to send both types of buffers.


What are the hacks? Might be worth saying what to do?

I guess I removed CoalescedBuffer::Compare at some point, so this note isn't so useful (and the hacks are a bit hard to quickly describe here). As a result, I just removed this point from the doc.

doc/sphinx/src/boundary_communication.rst

example/fine_advection/advection_driver.cpp

src/mesh/mesh.hpp

Yurlungur · 2024-11-26T17:32:58Z

src/bvals/comms/coalesced_buffers.hpp

+struct uid_set_hash {
+  std::size_t operator()(const std::set<Uid_t> &in) const {
+    std::size_t lhs{0};
+    for (const auto &uid : in) {
+      std::size_t rhs = std::hash<Uid_t>()(uid);
+      lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
+    }
+    return lhs;
+  }
+};


this is like the third version of hash we've implemented lol. Is there any way to share some code between hashers?

yeah, maybe "Unify hash functions" can be added as a good first PR issue.

src/bvals/comms/coalesced_buffers.hpp

src/bvals/comms/coalesced_buffers.cpp

Yurlungur · 2024-11-26T17:40:19Z

src/bvals/comms/coalesced_buffers.cpp

I have to admit I did not understand everything going on in this file.

I do appreciate all the comments in this file (in combination with the doc above), but I also have to second that I'll trust the regression (and downstream) testing that the logic in here works as planned.

Yurlungur · 2024-11-26T17:46:27Z

Building on clang I get these warnings which would be nice to remove by being careful about print statements

[ 55%] Building CXX object parthenon/src/CMakeFiles/parthenon.dir/bvals/bvals.cpp.o
/home/jonahm/programming/riot/external/parthenon/src/bvals/comms/bnd_id.cpp: In member function ‘void parthenon::BndId::PrintInfo(const string&)’:
/home/jonahm/programming/riot/external/parthenon/src/bvals/comms/bnd_id.cpp:63:12: warning: format ‘%i’ expects argument of type ‘int’, but argument 8 has type ‘size_t’ {aka ‘long unsigned int’} [-Wformat=]
   63 |          "%i, buffer size = %i, buf_allocated = %i) [rank = %i]\n",
      |           ~^
      |            |
      |            int
      |           %li
   64 |          start.c_str(), Variable<Real>::GetLabel(var_id()).c_str(), send_gid(),
   65 |          recv_gid(), start_idx(), size(), coalesced_buf.size(), buf.size(), buf_allocated,
      |                                           ~~~~~~~~~~~~~~~~~~~~
      |                                                             |
      |                                                             size_t {aka long unsigned int}
/home/jonahm/programming/riot/external/parthenon/src/bvals/comms/bnd_id.cpp:63:30: warning: format ‘%i’ expects argument of type ‘int’, but argument 9 has type ‘size_t’ {aka ‘long unsigned int’} [-Wformat=]
   63 |          "%i, buffer size = %i, buf_allocated = %i) [rank = %i]\n",
      |                             ~^
      |                              |
      |                              int
      |                             %li
   64 |          start.c_str(), Variable<Real>::GetLabel(var_id()).c_str(), send_gid(),
   65 |          recv_gid(), start_idx(), size(), coalesced_buf.size(), buf.size(), buf_allocated,
      |                                                                 ~~~~~~~~~~
      |                                                                         |
      |                                                                         size_t {aka long unsigned int}

Should be easy enough to just do what the compiler says and change %i to %ld

Yurlungur · 2024-11-26T17:58:54Z

So naively running it in riot on the triple problem, coalesced comms hangs forever with sparsity on. Thoughts @lroberts36 ?

If that's not an easy fix, I'm ok merging this but we need to change the default to false.

Yurlungur · 2024-11-26T18:01:01Z

So naively running it in riot on the triple problem, coalesced comms hangs forever with sparsity on. Thoughts @lroberts36 ?

If that's not an easy fix, I'm ok merging this but we need to change the default to false.

Looks like it still hangs in riot if I turn off sparsity. Suggesting it maybe has to do with how riot uses mesh data?

lroberts36 · 2024-11-26T18:01:10Z

@Yurlungur: hm, let me take a look at it to see if there is some easy fix.

pgrete

Overall looks good and I think I conceptually followed what's going on.
I got some additional comments (and the view of view alloc might need to be addressed).
I also plan to do some downstream testing this week (ideally tomorrow knowing your schedule) and then approve if things work as expected.

doc/sphinx/src/boundary_communication.rst

pgrete · 2024-11-26T20:35:09Z

doc/sphinx/src/boundary_communication.rst

+.. code::
+
+   parthenon/mesh/do_coalesced_comms = true
+
+curently by default this is set to ``true``.


I'm in principle also happy with default true (assuming that all downstream codes work/perform as expected as others already noted).

pgrete · 2024-11-26T20:47:46Z

src/bvals/comms/bnd_info.cpp

  same_to_same = pmb->gid == nb.gid && nb.offsets.IsCell();
  lcoord_trans = nb.lcoord_trans;
-  if (!allocated) return;


Why can/should we go past this point now?
Was this related to the bug with the buffer size being 0 on first pass?

previously, I just bailed here because there was no point in doing the extra index range calculations. As you note, this is what caused the buffer size 0 on the first pass bug. Removing it shouldn't impact any behavior in pre-existing code and I doubt it had any noticeable performance impact.

src/bvals/comms/boundary_communication.cpp

src/bvals/comms/coalesced_buffers.hpp

src/bvals/comms/coalesced_buffers.cpp

pgrete · 2024-11-26T21:33:02Z

src/bvals/comms/coalesced_buffers.cpp

+  auto &bids = GetBndIdsOnDevice(vars, &comb_size);
+  Kokkos::parallel_for(
+      PARTHENON_AUTO_LABEL,
+      Kokkos::TeamPolicy<>(parthenon::DevExecSpace(), bids.size(), Kokkos::AUTO),


bids is potentially a low number (lower than the number of compute unites per device), isn't it?
Just so that we keep this in mind wrt the kernels' efficiency.

yes, this could definitely be somewhere we don't produce enough teams.

src/bvals/comms/coalesced_buffers.cpp

pgrete · 2024-11-26T21:40:02Z

src/bvals/comms/coalesced_buffers.cpp

+    // Unpack into per combined buffer information
+    int idx{nglobal};
+
+    for (int p = 0; p < npartitions; ++p) {


Not related to this piece of code but a more general question/comment. Do we still support arbitary partitions during runtime (so except for the "all blocks" and the "default split based on pack_size")?

This PR only allows for coalesced communication on the default partition (i.e. the one determined by pack_size). That is because there is a single coalesced message (plus a message for sparse info) for each MeshData. The contents of these messages need to be determined and sent to neighbor ranks during remeshing and we only do this for the default partitions. I am guessing we could add communication information for other partitions if we really wanted to at the same time, but there are probably some subtle gotchas in there.

Of course, I don't think this prevents you from setting up a different partitioning that you don't communicate on. Also, I think the issues with this PR we are seeing in Riot are related to an issue with the "all blocks" partition (i.e. it doesn't get a partition id assigned). It is probably necessary to generalize partition ids to include the all blocks partition (as well as other possible partitions maybe).

pgrete · 2024-11-26T21:43:40Z

src/bvals/comms/coalesced_buffers.cpp

I do appreciate all the comments in this file (in combination with the doc above), but I also have to second that I'll trust the regression (and downstream) testing that the logic in here works as planned.

pgrete · 2024-11-28T16:23:50Z

I now got the chance to do some downstream testing.
Seems to work without issues (sth funny is going on with the CI machine but I expect this to be orthogonal to this PR).

Also looks like there's a small performance penalty using the coalesced buffers even for many blocks per device.
I tried a 2 node run with 8 GPUs total and 512 blocks per GPU and the coalesced version ran about 10% slower (without any more detailed profiling).
I could also imagine to get different numbers when using a (significant) larger number of nodes.
For now I'll probably keep the default in AthenaPK to the legacy buffer filling until I get the chance to do some more detailed profiling.

Yurlungur · 2024-11-28T19:54:21Z

I now got the chance to do some downstream testing. Seems to work without issues (sth funny is going on with the CI machine but I expect this to be orthogonal to this PR).

Also looks like there's a small performance penalty using the coalesced buffers even for many blocks per device. I tried a 2 node run with 8 GPUs total and 512 blocks per GPU and the coalesced version ran about 10% slower (without any more detailed profiling). I could also imagine to get different numbers when using a (significant) larger number of nodes. For now I'll probably keep the default in AthenaPK to the legacy buffer filling until I get the chance to do some more detailed profiling.

Based on this, and based on the fact that even after modifying riot to look more like the examples, I can't get it to cycle with coalesced comms on, I think we should change the default to disabling coalesced comms.

lroberts36 added 12 commits October 16, 2024 20:20

start on combined communication

539ba5f

just reuse BndInfo

d1c1274

partial

e42ee36

cleanup serialization, decouple

40a0a02

missed on last commit

00ce27b

fix bug

64d655e

Actually set up to do communication

c3ddf52

actually add the communication

74f9c33

split into cpp file

ee04547

format

cc14c89

working mpi communication

295e8a3

pull out and store buffers

d8fbd65

lroberts36 added 2 commits October 19, 2024 15:44

fix serial builds

566f36d

be a little more careful

8a9a1bc

fglines-nv reviewed Oct 21, 2024

View reviewed changes

src/bvals/comms/boundary_communication.cpp Outdated Show resolved Hide resolved

lroberts36 added 11 commits October 21, 2024 14:08

Set things up for communication

75667ab

Make functions avilable on device

d56bc78

Add untested PackAndSend

a35fccf

Add receive and unpack

6436fdc

Receive everything

5fd8de5

compiles

95db032

small name change

9c4010f

segfault

b4efb3d

correctly point to send buffers

995913e

allow explicit staling of send buffers

160c77f

taking a few steps

6355185

brryan reviewed Oct 22, 2024

View reviewed changes

lroberts36 added 2 commits October 22, 2024 14:07

switch to reference symantics

12df3b6

remove print statements

b55659c

lroberts36 added 6 commits November 18, 2024 17:39

fix a couple of things

06bd5d3

remove unused

0bd053e

remove comment

0183590

oops

975ce4a

skip non-communicated variables

867af0a

Merge branch 'develop' into lroberts36/add-combined-buffer-communication

b573337

brryan approved these changes Nov 25, 2024

View reviewed changes

lroberts36 and others added 2 commits November 25, 2024 13:32

Update doc/sphinx/src/boundary_communication.rst

5648a82

Co-authored-by: Ben Ryan <[email protected]>

Update doc/sphinx/src/boundary_communication.rst

1bcfdba

Co-authored-by: Ben Ryan <[email protected]>

address brryan comment

7edde1e

lroberts36 added 2 commits November 25, 2024 13:57

doc at brryan's suggestion

5fefb71

remove commented outlines

9ae28c8

Yurlungur approved these changes Nov 26, 2024

View reviewed changes

pgrete reviewed Nov 26, 2024

View reviewed changes

lroberts36 added 6 commits November 26, 2024 16:35

do view of views correctly for Kokkos 4.5.1

e95f691

act on a bunch of small comments

c67a5fb

move functions and add max_iters to clear

6aa7dcd

fix buffer bugs?

0f3795d

no need to check if not doing coalesced

22ecb2e

Remove deprecated note

822a859

Merge branch 'develop' into lroberts36/add-combined-buffer-communication

8739373

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coalesced Buffer Communication #1192

Coalesced Buffer Communication #1192

lroberts36 commented Oct 17, 2024 •

edited

Loading

lroberts36 commented Oct 19, 2024 •

edited

Loading

brryan left a comment

brryan left a comment

brryan Nov 25, 2024

BenWibking commented Nov 25, 2024 •

edited

Loading

lroberts36 commented Nov 25, 2024

brryan commented Nov 25, 2024

Yurlungur left a comment

Yurlungur Nov 26, 2024

lroberts36 Nov 26, 2024

Yurlungur Nov 26, 2024

pgrete Nov 26, 2024

Yurlungur Nov 26, 2024

lroberts36 Nov 28, 2024

Yurlungur Nov 26, 2024

lroberts36 Nov 26, 2024

Yurlungur Nov 26, 2024

pgrete Nov 26, 2024

Yurlungur commented Nov 26, 2024

Yurlungur commented Nov 26, 2024

Yurlungur commented Nov 26, 2024

lroberts36 commented Nov 26, 2024

pgrete left a comment

pgrete Nov 26, 2024

pgrete Nov 26, 2024

lroberts36 Nov 26, 2024

pgrete Nov 26, 2024

lroberts36 Nov 26, 2024

pgrete Nov 26, 2024

lroberts36 Nov 28, 2024

pgrete Nov 26, 2024

pgrete commented Nov 28, 2024

Yurlungur commented Nov 28, 2024

		do_coalesced_comms{
		pin->GetOrAddBoolean("parthenon/mesh", "do_coalesced_comms", true)} {

Coalesced Buffer Communication #1192

Are you sure you want to change the base?

Coalesced Buffer Communication #1192

Conversation

lroberts36 commented Oct 17, 2024 • edited Loading

PR Summary

PR Checklist

lroberts36 commented Oct 19, 2024 • edited Loading

brryan left a comment

Choose a reason for hiding this comment

brryan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenWibking commented Nov 25, 2024 • edited Loading

lroberts36 commented Nov 25, 2024

brryan commented Nov 25, 2024

Yurlungur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yurlungur commented Nov 26, 2024

Yurlungur commented Nov 26, 2024

Yurlungur commented Nov 26, 2024

lroberts36 commented Nov 26, 2024

pgrete left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgrete commented Nov 28, 2024

Yurlungur commented Nov 28, 2024

lroberts36 commented Oct 17, 2024 •

edited

Loading

lroberts36 commented Oct 19, 2024 •

edited

Loading

BenWibking commented Nov 25, 2024 •

edited

Loading