Fix segfaults after terminate() #4351

roystgnr · 2025-12-10T16:20:06Z

#4237 caused us to segfault (at least on some systems) whenever we die from a thrown exception. Cleaning up our new fancy thread-safe shims from our terminate handler in addition to from our LibMeshInit destructor fixes that for me.

I don't know of any compilers that unwind before terminate(), but the standard says it's allowed and we don't want to hang if it happens.

lindsayad · 2025-12-10T16:43:47Z

src/base/libmesh.C

+  if (std::uncaught_exceptions())
+#endif
+    this->comm().barrier();


This seems like a potential parallel hang? Generally speaking we can't count on all ranks uniformly having/not-having uncaught exceptions

I was wondering about this too, but we already have to handle exceptions thrown on subsets of processors by rethrowing them on all procs, so I don't think this introduces a new hang...

Let me distinguish between "not-yet-caught" (someone's LibMeshInit is inside a try block and the catch block will handle this exception) and "terminating" (we're unwinding before the terminate handler) exceptions.

If everyone has an exception at the same time, we're fine; nobody calls the barrier().

If nobody has exceptions, we're fine; everybody completes the barrier().

If any rank has a terminating exception, we're fine. Some ranks might think they're exiting cleanly and call barrier(), but then the terminating rank(s) get to MPI_Abort(), and the MPI stack brings down everybody.

If a rank has a not-yet-caught exception, but other ranks still think they're fine to keep doing libMesh communication, then in the old code they were going to have a parallel hang, with the exception-thrower's barrier() conflicting with another rank's comm.whatever(foo), but now there's at least some possibility of an application's clever catch() handler untangling such a mess.

If a rank has a not-yet-caught exception, but other ranks think they're completely done with libMesh communication and destruct LibMeshInit, the catch handler now needs to be changed, e.g. to have its own barrier() as part of its untangling process. So, technically this is an API-changing diff, but honestly if anyone has written code that usefully catches an exception that unwinds LibMeshInit, they can probably also handle the change. ;-)

Let me distinguish between "not-yet-caught" (someone's LibMeshInit is inside a try block and the catch block will handle this exception) and "terminating" (we're unwinding before the terminate handler) exceptions.

Reading https://en.cppreference.com/w/cpp/error/uncaught_exception.html I don't see any difference between "not-yet-caught" and "terminating". It seems like all that matters is whether we're stack unwinding, which would be true in both cases.

If everyone has an exception at the same time, we're fine; nobody calls the barrier().

If nobody has exceptions, we're fine; everybody completes the barrier().

This seems exactly backwards to me? Shouldn't it be "if everyone is stack unwinding, then everyone completes the barrier" and "if no-one is stack unwinding, then no-one calls/goes-through the barrier"?

It seems like all that matters is whether we're stack unwinding, which would be true in both cases.

It's not. If we're not-yet-caught, then we're stack unwinding. But if we're terminating, then the compiler is allowed to unwind the stack before calling the terminate handler but is also allowed to just call the terminate handler without unwinding anything.

To uncaught_exceptions(), all that matters is whether we're stack unwinding or not, but to our attempt to avoid a parallel hang we want to consider the differing cases.

This seems exactly backwards to me? Shouldn't it be "if everyone is stack unwinding, then everyone completes the barrier" and "if no-one is stack unwinding, then no-one calls/goes-through the barrier"?

If everyone is stack unwinding ... damn it. I didn't type the ! before std::uncaught_exceptions(), did I? I swear it's there in the imaginary code in my head. My comments here are describing what the code should be doing; yours are describing what it actually is doing.

I'll fix it now. Thank you!!

moosebuild · 2025-12-10T21:02:48Z

Job Coverage, step Generate coverage on cd248fb wanted to post the following:

Coverage

	069b1b	#4351 cd248f
	Total	Total	+/-	New
Rate	65.26%	65.27%	+0.00%	58.82%
Hits	77390	77396	+6	10
Misses	41189	41188	-1	7

Diff coverage report

Full coverage report

Warnings

New new line coverage rate 58.82% is less than the suggested 90.0%

This comment will be updated on new commits.

I was feeling proud that my lengthy comment clearly explained what the code here is intended to do; it would be even better if the code itself also matched what the code is intended to do.

See libMesh/libmesh#4351 for more context.

roystgnr added 4 commits December 10, 2025 10:14

Comment closing brace

550d198

Factor out cleanup_stream_buffers()

2ec2155

cleanup_stream_buffers() in terminate()

5957c54

Check for exception unwinding in ~LibMeshInit

1f3cf47

I don't know of any compilers that unwind before terminate(), but the standard says it's allowed and we don't want to hang if it happens.

lindsayad reviewed Dec 10, 2025

View reviewed changes

jwpeterson approved these changes Dec 10, 2025

View reviewed changes

Call barrier() iff we're *not* stack unwinding

cd248fb

I was feeling proud that my lengthy comment clearly explained what the code here is intended to do; it would be even better if the code itself also matched what the code is intended to do.

lindsayad approved these changes Dec 11, 2025

View reviewed changes

lindsayad merged commit acf6265 into libMesh:devel Dec 11, 2025
21 checks passed

roystgnr added a commit to roystgnr/TIMPI that referenced this pull request Dec 11, 2025

Don't barrier() when we're unwinding an exception

8c9c5e1

See libMesh/libmesh#4351 for more context.

roystgnr mentioned this pull request Dec 11, 2025

Don't barrier() when we're unwinding an exception libMesh/TIMPI#158

Merged

roystgnr deleted the fix_terminate_segfault branch December 11, 2025 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix segfaults after terminate() #4351

Fix segfaults after terminate() #4351

roystgnr commented Dec 10, 2025

Uh oh!

lindsayad Dec 10, 2025

Uh oh!

jwpeterson Dec 10, 2025

Uh oh!

roystgnr Dec 10, 2025

Uh oh!

lindsayad Dec 10, 2025

Uh oh!

roystgnr Dec 11, 2025

Uh oh!

moosebuild commented Dec 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix segfaults after terminate() #4351

Fix segfaults after terminate() #4351

Conversation

roystgnr commented Dec 10, 2025

Uh oh!

lindsayad Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

jwpeterson Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

roystgnr Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

lindsayad Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

roystgnr Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

moosebuild commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage

Warnings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

moosebuild commented Dec 10, 2025 •

edited

Loading