Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fix MPI Communicator destruction (again) #945

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

bprather
Copy link
Collaborator

Remember #841? It's back!

A refresher: in CI tests, there was a semi-common (30-50%?) race condition between Parthenon's use of MPI and HDF5's use of MPI when making the last outputs and cleaning up. Parthenon would MPI_Comm_free a communicator, and HDF5 would then attempt to MPI_Comm_dup at the same time, which is apparently an issue for implementations.

I found (without explanation nor extensive testing) that I could make that race condition less frequent if I switched the destructor of Reduction objects to use MPI_Comm_disconnect instead of MPI_Comm_free. The difference is, MPI_Comm_free is asynchronous, and sets the current reference to null without actually destroying the communicator until every process has done so. It "disconnects" from the communicator, if you will. Whereas MPI_Comm_disconnect waits until it can confirm that the communicator has been destroyed. That is, it "frees" it (the MPI standard is so straightforward!).

In any case, after reorganizing communicators in KHARMA, I've found that when freeing a lot of communicators together, MPI_Comm_disconnect is very likely to just... hang forever. This condition happens in, for example, the destructor of Mesh, which calls the destructor of Packages_t which calls every package's destructor, which should in codes which follow our best practices free every communicator at once. So, I imagine it will become a problem for other folks as well, manifesting as a run which finishes, prints statistics, and then just... doesn't return. This will be a problem in batch jobs, where a job that would otherwise return will continue spending core-hours without printing anything suspicious.

This PR just reverts to using MPI_Comm_free as the destructor, which fixes the issue in KHARMA, and allows all tests to pass on my machine, both without any warnings/double-free/etc. If it fails CI, we can talk about other sorts of hacks for keeping HDF5 out of our way while we destroy the Mesh.

Teaching sand to think might have been a mistake, but letting it talk to itself was a catastrophe.

@Yurlungur
Copy link
Collaborator

I think the restriction here, if using MPI_Comm_free is that reducers cannot be stored in the driver, they have to be held in objects that have the same lifespan as (or shorter than) the mesh?

@bprather
Copy link
Collaborator Author

bprather commented Sep 29, 2023

Not quite. free and disconnect nominally do exactly the same thing: destroy the communicator. The only difference is synchronicity. Since both the Driver and Mesh are torn down before MPI_Finalize is called, there's not really a danger of lingering communicators here, so long as one scopes them in something.

In a world without HDF5 and with perfect thread safety, we'd call free anytime we want as we tear down the Mesh object -- or, we'd call disconnect and wait a little longer. But in our situation, free caused issues with HDF5, crashing a call to MPI_Comm_dup inside HDF5 somewhere, at least in the CI runs. However, disconnect is causing me much worse problems than free did, hence proposing the switch back.

Until anyone else runs into this issue, or I run into the old race condition, I'm fine letting this PR sit here as WIP (not like it's going to fall out of date!). If it is going to be upstreamed, I should take a dig into reproducing the race condition we saw in #841, and verify that newer Parthenon/HDF5/production simulations won't hit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants