Skip to content

Conversation

@roystgnr
Copy link
Member

#4237 caused us to segfault (at least on some systems) whenever we die from a thrown exception. Cleaning up our new fancy thread-safe shims from our terminate handler in addition to from our LibMeshInit destructor fixes that for me.

I don't know of any compilers that unwind before terminate(), but the
standard says it's allowed and we don't want to hang if it happens.
Comment on lines 844 to 846
if (std::uncaught_exceptions())
#endif
this->comm().barrier();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a potential parallel hang? Generally speaking we can't count on all ranks uniformly having/not-having uncaught exceptions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about this too, but we already have to handle exceptions thrown on subsets of processors by rethrowing them on all procs, so I don't think this introduces a new hang...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me distinguish between "not-yet-caught" (someone's LibMeshInit is inside a try block and the catch block will handle this exception) and "terminating" (we're unwinding before the terminate handler) exceptions.

If everyone has an exception at the same time, we're fine; nobody calls the barrier().

If nobody has exceptions, we're fine; everybody completes the barrier().

If any rank has a terminating exception, we're fine. Some ranks might think they're exiting cleanly and call barrier(), but then the terminating rank(s) get to MPI_Abort(), and the MPI stack brings down everybody.

If a rank has a not-yet-caught exception, but other ranks still think they're fine to keep doing libMesh communication, then in the old code they were going to have a parallel hang, with the exception-thrower's barrier() conflicting with another rank's comm.whatever(foo), but now there's at least some possibility of an application's clever catch() handler untangling such a mess.

If a rank has a not-yet-caught exception, but other ranks think they're completely done with libMesh communication and destruct LibMeshInit, the catch handler now needs to be changed, e.g. to have its own barrier() as part of its untangling process. So, technically this is an API-changing diff, but honestly if anyone has written code that usefully catches an exception that unwinds LibMeshInit, they can probably also handle the change. ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me distinguish between "not-yet-caught" (someone's LibMeshInit is inside a try block and the catch block will handle this exception) and "terminating" (we're unwinding before the terminate handler) exceptions.

Reading https://en.cppreference.com/w/cpp/error/uncaught_exception.html I don't see any difference between "not-yet-caught" and "terminating". It seems like all that matters is whether we're stack unwinding, which would be true in both cases.

If everyone has an exception at the same time, we're fine; nobody calls the barrier().

If nobody has exceptions, we're fine; everybody completes the barrier().

This seems exactly backwards to me? Shouldn't it be "if everyone is stack unwinding, then everyone completes the barrier" and "if no-one is stack unwinding, then no-one calls/goes-through the barrier"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like all that matters is whether we're stack unwinding, which would be true in both cases.

It's not. If we're not-yet-caught, then we're stack unwinding. But if we're terminating, then the compiler is allowed to unwind the stack before calling the terminate handler but is also allowed to just call the terminate handler without unwinding anything.

To uncaught_exceptions(), all that matters is whether we're stack unwinding or not, but to our attempt to avoid a parallel hang we want to consider the differing cases.

This seems exactly backwards to me? Shouldn't it be "if everyone is stack unwinding, then everyone completes the barrier" and "if no-one is stack unwinding, then no-one calls/goes-through the barrier"?

If everyone is stack unwinding ... damn it. I didn't type the ! before std::uncaught_exceptions(), did I? I swear it's there in the imaginary code in my head. My comments here are describing what the code should be doing; yours are describing what it actually is doing.

I'll fix it now. Thank you!!

@moosebuild
Copy link

moosebuild commented Dec 10, 2025

Job Coverage, step Generate coverage on cd248fb wanted to post the following:

Coverage

069b1b #4351 cd248f
Total Total +/- New
Rate 65.26% 65.27% +0.00% 58.82%
Hits 77390 77396 +6 10
Misses 41189 41188 -1 7

Diff coverage report

Full coverage report

Warnings

  • New new line coverage rate 58.82% is less than the suggested 90.0%

This comment will be updated on new commits.

I was feeling proud that my lengthy comment clearly explained what the
code here is intended to do; it would be even better if the code itself
also matched what the code is intended to do.
@lindsayad lindsayad merged commit acf6265 into libMesh:devel Dec 11, 2025
21 checks passed
roystgnr added a commit to roystgnr/TIMPI that referenced this pull request Dec 11, 2025
@roystgnr roystgnr deleted the fix_terminate_segfault branch December 11, 2025 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants