-
Notifications
You must be signed in to change notification settings - Fork 300
Fix segfaults after terminate() #4351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I don't know of any compilers that unwind before terminate(), but the standard says it's allowed and we don't want to hang if it happens.
src/base/libmesh.C
Outdated
| if (std::uncaught_exceptions()) | ||
| #endif | ||
| this->comm().barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a potential parallel hang? Generally speaking we can't count on all ranks uniformly having/not-having uncaught exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about this too, but we already have to handle exceptions thrown on subsets of processors by rethrowing them on all procs, so I don't think this introduces a new hang...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me distinguish between "not-yet-caught" (someone's LibMeshInit is inside a try block and the catch block will handle this exception) and "terminating" (we're unwinding before the terminate handler) exceptions.
If everyone has an exception at the same time, we're fine; nobody calls the barrier().
If nobody has exceptions, we're fine; everybody completes the barrier().
If any rank has a terminating exception, we're fine. Some ranks might think they're exiting cleanly and call barrier(), but then the terminating rank(s) get to MPI_Abort(), and the MPI stack brings down everybody.
If a rank has a not-yet-caught exception, but other ranks still think they're fine to keep doing libMesh communication, then in the old code they were going to have a parallel hang, with the exception-thrower's barrier() conflicting with another rank's comm.whatever(foo), but now there's at least some possibility of an application's clever catch() handler untangling such a mess.
If a rank has a not-yet-caught exception, but other ranks think they're completely done with libMesh communication and destruct LibMeshInit, the catch handler now needs to be changed, e.g. to have its own barrier() as part of its untangling process. So, technically this is an API-changing diff, but honestly if anyone has written code that usefully catches an exception that unwinds LibMeshInit, they can probably also handle the change. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me distinguish between "not-yet-caught" (someone's LibMeshInit is inside a try block and the catch block will handle this exception) and "terminating" (we're unwinding before the terminate handler) exceptions.
Reading https://en.cppreference.com/w/cpp/error/uncaught_exception.html I don't see any difference between "not-yet-caught" and "terminating". It seems like all that matters is whether we're stack unwinding, which would be true in both cases.
If everyone has an exception at the same time, we're fine; nobody calls the
barrier().If nobody has exceptions, we're fine; everybody completes the
barrier().
This seems exactly backwards to me? Shouldn't it be "if everyone is stack unwinding, then everyone completes the barrier" and "if no-one is stack unwinding, then no-one calls/goes-through the barrier"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like all that matters is whether we're stack unwinding, which would be true in both cases.
It's not. If we're not-yet-caught, then we're stack unwinding. But if we're terminating, then the compiler is allowed to unwind the stack before calling the terminate handler but is also allowed to just call the terminate handler without unwinding anything.
To uncaught_exceptions(), all that matters is whether we're stack unwinding or not, but to our attempt to avoid a parallel hang we want to consider the differing cases.
This seems exactly backwards to me? Shouldn't it be "if everyone is stack unwinding, then everyone completes the barrier" and "if no-one is stack unwinding, then no-one calls/goes-through the barrier"?
If everyone is stack unwinding ... damn it. I didn't type the ! before std::uncaught_exceptions(), did I? I swear it's there in the imaginary code in my head. My comments here are describing what the code should be doing; yours are describing what it actually is doing.
I'll fix it now. Thank you!!
|
Job Coverage, step Generate coverage on cd248fb wanted to post the following: Coverage
Warnings
This comment will be updated on new commits. |
||||||||||||||||||||||||||
I was feeling proud that my lengthy comment clearly explained what the code here is intended to do; it would be even better if the code itself also matched what the code is intended to do.
See libMesh/libmesh#4351 for more context.
#4237 caused us to segfault (at least on some systems) whenever we die from a thrown exception. Cleaning up our new fancy thread-safe shims from our terminate handler in addition to from our LibMeshInit destructor fixes that for me.