-
Notifications
You must be signed in to change notification settings - Fork 28
refactor: EventLoop locking cleanups + client disconnect exception #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Because this change removes --- a/src/ipc/capnp/protocol.cpp
+++ b/src/ipc/capnp/protocol.cpp
@@ -41,10 +41,7 @@ class CapnpProtocol : public Protocol
public:
~CapnpProtocol() noexcept(true)
{
- if (m_loop) {
- std::unique_lock<std::mutex> lock(m_loop->m_mutex);
- m_loop->removeClient(lock);
- }
+ m_loop_ref.reset();
if (m_loop_thread.joinable()) m_loop_thread.join();
assert(!m_loop);
};
@@ -83,10 +80,7 @@ public:
m_loop_thread = std::thread([&] {
util::ThreadRename("capnp-loop");
m_loop.emplace(exe_name, &IpcLogFn, &m_context);
- {
- std::unique_lock<std::mutex> lock(m_loop->m_mutex);
- m_loop->addClient(lock);
- }
+ m_loop_ref.emplace(*m_loop);
promise.set_value();
m_loop->loop();
m_loop.reset();
@@ -96,6 +90,7 @@ public:
Context m_context;
std::thread m_loop_thread;
std::optional<mp::EventLoop> m_loop;
+ std::optional<mp::EventLoopRef> m_loop_ref;
};
} // namespace |
Updated ce4814f -> b47ea9f ( |
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client flush() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the flush() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
Rebased b47ea9f -> f15ef6c ( |
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client flush() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the flush() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client flush() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the flush() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client flush() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the flush() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
Rebased f15ef6c -> 2214358 ( |
Rebased 2214358 -> 6715a96 ( |
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client flush() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the flush() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client flush() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the flush() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client flush() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the flush() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client flush() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the flush() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
Rewrote the PR description since this is now a dependency of bitcoin/bitcoin#32345 and needed to fix bugs #123 and #176 |
Antoine Poinsot <[email protected]> reported a bug with details in bitcoin-core#182 (comment) where an IPC client rapidly connecting and disconnecting to the server in a loop could cause problems. One problem this uncovered was hangs on shutdown fixed by the previous commit. Another problem it uncovered is that if a disconnect happened while the IPC server was executing an IPC call, and the onDisconnect() handling code ran before the IPC call finished, memory corruption would occur usually leading to a hang or crash. This happend because the ~ProxyServerBase() destructor was written with the assumption that it would always be called either from the ~Connection() destructor, or before that point, so its m_context.connection pointer would always be valid. Typically this was the case, but not if the connection was closed during an active IPC call. A unit test to reproduce this error is added in the subsequent commit. The fix for the bug here is actually a code simplification. Previously ~ProxyServerBase() destructor would append a std::function callback destroying the ProxyServerBase::m_impl object to m_context.connection->m_async_cleanup_fns, so the object destructor could be called asynchronously without blocking the event loop. Now it just appends the same callback function to m_context.loop->m_async_fns without going through the Connection object. The new code is an improvement over the previous code in two other ways: - It is a code simplification since previous code was needlessly complicated. - In the case where there is no disconnect, and the remote ProxyClient is destroyed, and no "destroy" method is declared, this change causes the ProxyServer::m_impl object to be freed shortly after the client is freed, instead of being delayed until the connection is closed. (If a "destroy" method is declared, this change has no effect because in that case destroying the ProxyClient calls destroy and destroys ProxyServer::m_impl synchronously.) The previous code was trying to make the ProxyServer cleanup with Connection::m_async_cleanup_fns mirror the ProxyClient cleanup with Connection::m_sync_cleanup_fns, but it was just fragile and unncecessarily complicated. Some comments about ProxyClient cleanup code are updated for clarity here but no changes are made.
This change adds two unit tests to make sure shutdown happens cleanly if the the client disconnects while the server is processing an IPC call. The first test checks a synchronous case which has always worked, where the disconnect starts but is not processed by the event loop until after the IPC call completes. The second test checks an asynchronous case which was fixed by the previous commit, where the disconnect is processed before the IPC call generates a response. This test would usually crash or hang prior to the fix in the previous commit. It would crash reliably with address sanitizer.
Suggested by Sjors Provoost <[email protected]> bitcoin-core#160 (comment)
Suggested by Sjors Provoost <[email protected]> bitcoin-core#160 (comment)
Rebased 0b45cb5 -> 4c33734 ( |
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client flush() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the flush() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client stop() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the stop() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
doc: Document ProxyClientBase destroy_connection option Suggested by Sjors Provoost <[email protected]> bitcoin-core#160 (comment)
doc: Document ProxyClientBase destroy_connection option Suggested by Sjors Provoost <[email protected]> bitcoin-core#160 (comment)
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client stop() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the stop() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client stop() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the stop() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client stop() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the stop() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
Use EventLoopRef to avoid reference counting bugs and be more exception safe and deal with removal of addClient/removeClient methods in bitcoin-core/libmultiprocess#160 A test update is also required due to bitcoin-core/libmultiprocess#160 to deal with changed reference count semantics. In IpcPipeTest(), it is is now necessary to destroy the client Proxy object instead of just the client Connection object to decrease the event loop reference count and allow the loop to exit so the test does not hang on shutdown.
This fixes an error reported by Antoine Poinsot <[email protected]> in bitcoin-core/libmultiprocess#123 that does not happen in master, but does happen with bitcoin#10102 applied, where if the child bitcoin-wallet process is killed (either by an external signal or by Ctrl-C as reported in the issue) the bitcoin-node process will not shutdown cleanly after that because chain client stop() calls will fail. This change fixes the problem by handling ipc::Exception errors thrown during the stop() calls, and it relies on the fixes to disconnect detection implemented in bitcoin-core/libmultiprocess#160 to work effectively.
{ | ||
public: | ||
explicit EventLoopRef(EventLoop& loop, std::unique_lock<std::mutex>* lock = nullptr); | ||
EventLoopRef(EventLoopRef&& other) noexcept : m_loop(other.m_loop) { other.m_loop = nullptr; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell it is correct to not copy over the lock from the moved object here as long as the moved object has no lock. Should that be asserted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #160 (comment)
In commit "proxy-io.h: Add EventLoopRef RAII class handle addClient/removeClient refcounting" (9aaeec3)
As far as I can tell it is correct to not copy over the lock from the moved object here as long as the moved object has no lock. Should that be asserted?
It's kind of a corner case, and I think either behavior could be useful here. The point of this constructor is to be able to efficiently move a reference count from one owner to another would pointlessly incrementing & decrement the count when creating one reference and dropping the other.
I feel like it probably would not be useful to copy the m_lock
pointer when doing that because practical effect of copying would be to make the new EventLoopRef object destructor skip locking when it is destroyed (it would assert m_lock is locked instead).
On the other hand the default move-constructor would copy the m_lock value so maybe it is less surprising to follow the default behavior. And if the caller really wanted m_lock
to be null after moving they could set it to null themselves.
So this is a good catch and I think I'd now lean towards adding m_lock{other.m_lock}
but I don't think the difference should matter in practice.
More broadly I was also thinking about whether I could split EventLoopRef into a superclass without an m_lock
member and a subclass with one to be able to get rid of the relock
option added in ea38392 which requires turning off thread safety analysis in the reset function. At the moment this class seems good at preventing reference counting bugs because it guarantees reference counts will always be decremented if they are incremented, but it doesn't have very easy to understand locking semantics so that's probably an area for improvement.
while (!done()) { | ||
if (!m_async_fns.empty()) { | ||
while (m_async_fns) { | ||
if (!m_async_fns->empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance of encountering a race condition here between checking m_async_fns
and calling ->empty()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #160 (comment)
In commit "Prevent EventLoop async cleanup thread early exit during shutdown" (ea38392)
Is there a chance of encountering a race condition here between checking
m_async_fns
and calling->empty()
?
There shouldn't be just because m_async_fns is guarded by m_mutex which is locked above.
@@ -215,5 +217,69 @@ KJ_TEST("Calling IPC method after server connection is closed") | |||
KJ_EXPECT(disconnected); | |||
} | |||
|
|||
KJ_TEST("Calling IPC method and disconnecting during the call") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous test exercises a server disconnect. Would it be useful to either add another case, or expand it to test the server going away during a client call? I tried just setting
setup.server->m_impl->m_fn = setup.server_disconnect;
but that crashes, which I guess is kind of expected, since we're tearing down objects that are used in the server thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #160 (comment)
In commit "test: Test disconnects during IPC calls" (84cf56a)
The previous test exercises a server disconnect. Would it be useful to either add another case, or expand it to test the server going away during a client call?
That's a good idea. It should be legitimate for an IPC call on the server to close the connection which triggered the call, or for some other thread on the server to close the connection while a slow IPC call is executing. So it would make sense to add tests covering these cases and I'm a little surprised the change you tested doesn't work.
I should look into what the cause of the problem is but I suspect it is something simple like an unchecked null pointer or uncaught exception. I am beginning to learn that there are many different ways connections can be disconnected and interrupted and in capnproto each way tends to lead to a different piece of code throwing an exception. #183 is another case found today. It seems like I should probably add broader exception handlers instead of letting errors bubble up so far.
For what it's worth I think the code here could be merged. Further improving exception handling does not all have to happen in here. Getting the EventLoop locking cleanups in is a clear improvement on the baseline in my eyes. |
re: #160 (comment)
Yes that makes a lot of sense. There are more improvements to be made but I don't think it's good for this PR to keep growing forever just because the next set of improvements depend on current ones. So I'll plan to not make any more changes here and merge in the next day or so, but I would appreciate if you or Sjors or other reviewers could look at this a little more and add acks if appropriate before then. As mentioned #172 (comment) I would really like to get away from the practice of merging PRs here without current acks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 84cf56a
This PR was originally written as a code cleanup to follow up on review comments in earlier PRs. Several other commits were added afterward building on the earlier changes, and necessary for the bugfixes in bitcoin/bitcoin#32345, which resolves issues #123, #176 and #182.
Summary of changes:
bitcoin/bitcoin#32345 commit "ipc: Handle bitcoin-wallet disconnection" to fix #123.