Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalesced Buffer Communication #1192

Open
wants to merge 117 commits into
base: develop
Choose a base branch
from

Conversation

lroberts36
Copy link
Collaborator

@lroberts36 lroberts36 commented Oct 17, 2024

PR Summary

Coalesced buffer communication, see the included docs for a description. This came from the combined buffer communication work we did for the TACC Hackathon.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@lroberts36
Copy link
Collaborator Author

lroberts36 commented Oct 19, 2024

TODO:

  • Write code for sending message structure info on remesh and (sort of) test
  • Migrate combined buffers to CommBuffers
  • Point BndId object to associated combined buffers
  • Add combined buffer sends
  • Add combined buffer receives
  • Add combined buffer packing kernel (involves getting BndId arrays on device)
  • Add combined buffer unpacking kernel (involves getting BndId arrays on device)
  • Stop sending single buffers

Copy link
Collaborator

@brryan brryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These small comments are probably just a distraction right now but figured I would record them. Nothing stands out to me as an issue right now

src/bvals/comms/bnd_info.cpp Show resolved Hide resolved
src/bvals/comms/bnd_info.hpp Outdated Show resolved Hide resolved
src/bvals/comms/combined_buffers.cpp Outdated Show resolved Hide resolved
src/bvals/comms/combined_buffers.hpp Outdated Show resolved Hide resolved
src/bvals/comms/combined_buffers.cpp Outdated Show resolved Hide resolved
src/bvals/comms/combined_buffers.hpp Outdated Show resolved Hide resolved
src/bvals/comms/combined_buffers.hpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@brryan brryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG(reat)TM! I only had some small queries, I didn't detect any issues in the logic.

doc/sphinx/src/boundary_communication.rst Outdated Show resolved Hide resolved
doc/sphinx/src/boundary_communication.rst Outdated Show resolved Hide resolved
doc/sphinx/src/boundary_communication.rst Show resolved Hide resolved
example/fine_advection/advection_driver.cpp Show resolved Hide resolved
src/basic_types.hpp Show resolved Hide resolved
src/bvals/comms/boundary_communication.cpp Outdated Show resolved Hide resolved
src/bvals/comms/coalesced_buffers.cpp Show resolved Hide resolved
Comment on lines +91 to +92
do_coalesced_comms{
pin->GetOrAddBoolean("parthenon/mesh", "do_coalesced_comms", true)} {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think default to true is reasonable, we already have some evidence that your solution produces improved performance at least outside of AthenaPK-style workflows

src/utils/communication_buffer.hpp Outdated Show resolved Hide resolved
@BenWibking
Copy link
Collaborator

BenWibking commented Nov 25, 2024

The macOS CI failed with this error:

Start 38: Swarm memory management
38/70 Test #38: Swarm memory management ....................................................***Failed    0.03 sec
Filters: Swarm memory management
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[Mac-1732552[93](https://github.com/parthenon-hpc-lab/parthenon/actions/runs/12014756671/job/33491274932?pr=1192#step:8:94)2821.local:12435] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

The macOS runner uses OpenMPI, so I'm not sure why it would fail here but work for the Linux MPI CI...?

Edit: nvm, it fails here too: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/12014756878/job/33491277357?pr=1192

@lroberts36
Copy link
Collaborator Author

Edit: nvm, it fails here too: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/12014756878/job/33491277357?pr=1192

@BenWibking: Yeah, I think this is a problem with develop, not this particular PR. It looks like all other PRs are currently failing this too...

@brryan
Copy link
Collaborator

brryan commented Nov 25, 2024

Edit: nvm, it fails here too: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/12014756878/job/33491277357?pr=1192

@BenWibking: Yeah, I think this is a problem with develop, not this particular PR. It looks like all other PRs are currently failing this too...

I'll take a look at this, I don't understand how Swarm CI issues keep getting into develop

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really impressive that such a big feature is encapsulated in only a 1200 line diff. Nice work. Really excited to try this out. Testing it in riot on a few cores on my laptop now.

Comment on lines +506 to +510
.. code::

parthenon/mesh/do_coalesced_comms = true

curently by default this is set to ``true``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think this works for all downstreams including kharma, artemis and riot I am in favor of the default being true. If there's some doubt, we should maybe change the default to false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tend to lean toward default false until there is more downstream testing. To make sure it is passing regression tests, it needs to be set to true for now though (or we would have to change all the parameter input). There is some discussion of this above though where @brryan suggested we keep true.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with it being default true. But I would also be fine modifying all the tests to set it to true manually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in principle also happy with default true (assuming that all downstream codes work/perform as expected as others already noted).

doc/sphinx/src/boundary_communication.rst Show resolved Hide resolved
Comment on lines 566 to 568
- Currently, there is a ``Compare`` method in ``CoalescedBuffer`` that is just for
debugging. It should compare the received coalesced messages to the variable-boundary buffer
messages, but using it requires some hacks in the code to send both types of buffers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the hacks? Might be worth saying what to do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I removed CoalescedBuffer::Compare at some point, so this note isn't so useful (and the hacks are a bit hard to quickly describe here). As a result, I just removed this point from the doc.

doc/sphinx/src/boundary_communication.rst Show resolved Hide resolved
example/fine_advection/advection_driver.cpp Show resolved Hide resolved
src/mesh/mesh.hpp Outdated Show resolved Hide resolved
Comment on lines 39 to 48
struct uid_set_hash {
std::size_t operator()(const std::set<Uid_t> &in) const {
std::size_t lhs{0};
for (const auto &uid : in) {
std::size_t rhs = std::hash<Uid_t>()(uid);
lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
}
return lhs;
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is like the third version of hash we've implemented lol. Is there any way to share some code between hashers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe "Unify hash functions" can be added as a good first PR issue.

src/bvals/comms/coalesced_buffers.hpp Outdated Show resolved Hide resolved
src/bvals/comms/coalesced_buffers.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I did not understand everything going on in this file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do appreciate all the comments in this file (in combination with the doc above), but I also have to second that I'll trust the regression (and downstream) testing that the logic in here works as planned.

@Yurlungur
Copy link
Collaborator

Building on clang I get these warnings which would be nice to remove by being careful about print statements

[ 55%] Building CXX object parthenon/src/CMakeFiles/parthenon.dir/bvals/bvals.cpp.o
/home/jonahm/programming/riot/external/parthenon/src/bvals/comms/bnd_id.cpp: In member function ‘void parthenon::BndId::PrintInfo(const string&)’:
/home/jonahm/programming/riot/external/parthenon/src/bvals/comms/bnd_id.cpp:63:12: warning: format ‘%i’ expects argument of type ‘int’, but argument 8 has type ‘size_t’ {aka ‘long unsigned int’} [-Wformat=]
   63 |          "%i, buffer size = %i, buf_allocated = %i) [rank = %i]\n",
      |           ~^
      |            |
      |            int
      |           %li
   64 |          start.c_str(), Variable<Real>::GetLabel(var_id()).c_str(), send_gid(),
   65 |          recv_gid(), start_idx(), size(), coalesced_buf.size(), buf.size(), buf_allocated,
      |                                           ~~~~~~~~~~~~~~~~~~~~
      |                                                             |
      |                                                             size_t {aka long unsigned int}
/home/jonahm/programming/riot/external/parthenon/src/bvals/comms/bnd_id.cpp:63:30: warning: format ‘%i’ expects argument of type ‘int’, but argument 9 has type ‘size_t’ {aka ‘long unsigned int’} [-Wformat=]
   63 |          "%i, buffer size = %i, buf_allocated = %i) [rank = %i]\n",
      |                             ~^
      |                              |
      |                              int
      |                             %li
   64 |          start.c_str(), Variable<Real>::GetLabel(var_id()).c_str(), send_gid(),
   65 |          recv_gid(), start_idx(), size(), coalesced_buf.size(), buf.size(), buf_allocated,
      |                                                                 ~~~~~~~~~~
      |                                                                         |
      |                                                                         size_t {aka long unsigned int}

Should be easy enough to just do what the compiler says and change %i to %ld

@Yurlungur
Copy link
Collaborator

So naively running it in riot on the triple problem, coalesced comms hangs forever with sparsity on. Thoughts @lroberts36 ?

If that's not an easy fix, I'm ok merging this but we need to change the default to false.

@Yurlungur
Copy link
Collaborator

So naively running it in riot on the triple problem, coalesced comms hangs forever with sparsity on. Thoughts @lroberts36 ?

If that's not an easy fix, I'm ok merging this but we need to change the default to false.

Looks like it still hangs in riot if I turn off sparsity. Suggesting it maybe has to do with how riot uses mesh data?

@lroberts36
Copy link
Collaborator Author

@Yurlungur: hm, let me take a look at it to see if there is some easy fix.

Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good and I think I conceptually followed what's going on.
I got some additional comments (and the view of view alloc might need to be addressed).
I also plan to do some downstream testing this week (ideally tomorrow knowing your schedule) and then approve if things work as expected.

doc/sphinx/src/boundary_communication.rst Show resolved Hide resolved
Comment on lines +506 to +510
.. code::

parthenon/mesh/do_coalesced_comms = true

curently by default this is set to ``true``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in principle also happy with default true (assuming that all downstream codes work/perform as expected as others already noted).

same_to_same = pmb->gid == nb.gid && nb.offsets.IsCell();
lcoord_trans = nb.lcoord_trans;
if (!allocated) return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can/should we go past this point now?
Was this related to the bug with the buffer size being 0 on first pass?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously, I just bailed here because there was no point in doing the extra index range calculations. As you note, this is what caused the buffer size 0 on the first pass bug. Removing it shouldn't impact any behavior in pre-existing code and I doubt it had any noticeable performance impact.

src/bvals/comms/boundary_communication.cpp Outdated Show resolved Hide resolved
src/bvals/comms/coalesced_buffers.hpp Show resolved Hide resolved
src/bvals/comms/coalesced_buffers.cpp Outdated Show resolved Hide resolved
auto &bids = GetBndIdsOnDevice(vars, &comb_size);
Kokkos::parallel_for(
PARTHENON_AUTO_LABEL,
Kokkos::TeamPolicy<>(parthenon::DevExecSpace(), bids.size(), Kokkos::AUTO),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bids is potentially a low number (lower than the number of compute unites per device), isn't it?
Just so that we keep this in mind wrt the kernels' efficiency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this could definitely be somewhere we don't produce enough teams.

src/bvals/comms/coalesced_buffers.cpp Outdated Show resolved Hide resolved
// Unpack into per combined buffer information
int idx{nglobal};

for (int p = 0; p < npartitions; ++p) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this piece of code but a more general question/comment. Do we still support arbitary partitions during runtime (so except for the "all blocks" and the "default split based on pack_size")?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only allows for coalesced communication on the default partition (i.e. the one determined by pack_size). That is because there is a single coalesced message (plus a message for sparse info) for each MeshData. The contents of these messages need to be determined and sent to neighbor ranks during remeshing and we only do this for the default partitions. I am guessing we could add communication information for other partitions if we really wanted to at the same time, but there are probably some subtle gotchas in there.

Of course, I don't think this prevents you from setting up a different partitioning that you don't communicate on. Also, I think the issues with this PR we are seeing in Riot are related to an issue with the "all blocks" partition (i.e. it doesn't get a partition id assigned). It is probably necessary to generalize partition ids to include the all blocks partition (as well as other possible partitions maybe).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do appreciate all the comments in this file (in combination with the doc above), but I also have to second that I'll trust the regression (and downstream) testing that the logic in here works as planned.

@pgrete
Copy link
Collaborator

pgrete commented Nov 28, 2024

I now got the chance to do some downstream testing.
Seems to work without issues (sth funny is going on with the CI machine but I expect this to be orthogonal to this PR).

Also looks like there's a small performance penalty using the coalesced buffers even for many blocks per device.
I tried a 2 node run with 8 GPUs total and 512 blocks per GPU and the coalesced version ran about 10% slower (without any more detailed profiling).
I could also imagine to get different numbers when using a (significant) larger number of nodes.
For now I'll probably keep the default in AthenaPK to the legacy buffer filling until I get the chance to do some more detailed profiling.

@Yurlungur
Copy link
Collaborator

I now got the chance to do some downstream testing. Seems to work without issues (sth funny is going on with the CI machine but I expect this to be orthogonal to this PR).

Also looks like there's a small performance penalty using the coalesced buffers even for many blocks per device. I tried a 2 node run with 8 GPUs total and 512 blocks per GPU and the coalesced version ran about 10% slower (without any more detailed profiling). I could also imagine to get different numbers when using a (significant) larger number of nodes. For now I'll probably keep the default in AthenaPK to the legacy buffer filling until I get the chance to do some more detailed profiling.

Based on this, and based on the fact that even after modifying riot to look more like the examples, I can't get it to cycle with coalesced comms on, I think we should change the default to disabling coalesced comms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants