Skip to content

refactor: EventLoop locking cleanups + client disconnect exception #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 19, 2025

Conversation

ryanofsky
Copy link
Collaborator

@ryanofsky ryanofsky commented Feb 10, 2025

This PR was originally written as a code cleanup to follow up on review comments in earlier PRs. Several other commits were added afterward building on the earlier changes, and necessary for the bugfixes in bitcoin/bitcoin#32345, which resolves issues #123, #176 and #182.

Summary of changes:

  • The commit "Improve IPC client disconnected exceptions" is the only external change, and lets clients detect when they are trying to call a remote method after a connection has been closed. This change is used by
    bitcoin/bitcoin#32345 commit "ipc: Handle bitcoin-wallet disconnection" to fix #123.
  • The commits "Prevent EventLoop async cleanup thread early exit during shutdown" and "Prevent IPC server crash if disconnected during IPC call" fix issues reported in #182.
  • New tests are added in other commits to verify these fixes, covering different kinds of disconnections.
  • The change to detect disconnects also relies on an earlier "Add ProxyContext EventLoop* member" commit.
  • An "Add EventLoopRef RAII class" commit simplifies a recent bugfix in #159 and also relies on the earlier "EventLoop* member" commit.
  • The "Remove DestructorCatcher and AsyncCallable" commit follows up on #144 (comment).
  • The "Add clang thread safety annotations" commit follows up on #129 (comment).

@ryanofsky
Copy link
Collaborator Author

Because this change removes EventLoop addClient and removeClient methods it will require an update to Bitcoin core if it is merged.

--- a/src/ipc/capnp/protocol.cpp
+++ b/src/ipc/capnp/protocol.cpp
@@ -41,10 +41,7 @@ class CapnpProtocol : public Protocol
 public:
     ~CapnpProtocol() noexcept(true)
     {
-        if (m_loop) {
-            std::unique_lock<std::mutex> lock(m_loop->m_mutex);
-            m_loop->removeClient(lock);
-        }
+        m_loop_ref.reset();
         if (m_loop_thread.joinable()) m_loop_thread.join();
         assert(!m_loop);
     };
@@ -83,10 +80,7 @@ public:
         m_loop_thread = std::thread([&] {
             util::ThreadRename("capnp-loop");
             m_loop.emplace(exe_name, &IpcLogFn, &m_context);
-            {
-                std::unique_lock<std::mutex> lock(m_loop->m_mutex);
-                m_loop->addClient(lock);
-            }
+            m_loop_ref.emplace(*m_loop);
             promise.set_value();
             m_loop->loop();
             m_loop.reset();
@@ -96,6 +90,7 @@ public:
     Context m_context;
     std::thread m_loop_thread;
     std::optional<mp::EventLoop> m_loop;
+    std::optional<mp::EventLoopRef> m_loop_ref;
 };
 } // namespace

@ryanofsky ryanofsky changed the title refactor: EventLoop locking cleanups refactor: EventLoop locking cleanups and client Disconnect errors Apr 24, 2025
@ryanofsky ryanofsky changed the title refactor: EventLoop locking cleanups and client Disconnect errors refactor: EventLoop locking cleanups + client disconnect exceptions Apr 24, 2025
@ryanofsky ryanofsky changed the title refactor: EventLoop locking cleanups + client disconnect exceptions refactor: EventLoop locking cleanups + client disconnect exception Apr 24, 2025
@ryanofsky
Copy link
Collaborator Author

ryanofsky commented Apr 24, 2025

Updated ce4814f -> b47ea9f (pr/eventlock.1 -> pr/eventlock.2, compare), making test and comment fixes, splitting commits, and adding connection disconnects tests and exceptions to help resolve #123

ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Apr 24, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Apr 24, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client flush()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the flush() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
@ryanofsky
Copy link
Collaborator Author

Rebased b47ea9f -> f15ef6c (pr/eventlock.2 -> pr/eventlock.3, compare) so this can be merged cleanly in bitcoin PR bitcoin/bitcoin#32345. Also dropped DisconnectError exception type so original ipc::Exception can be used instead.

ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Apr 25, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Apr 25, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client flush()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the flush() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Apr 28, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client flush()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the flush() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Apr 28, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client flush()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the flush() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
@ryanofsky
Copy link
Collaborator Author

Rebased f15ef6c -> 2214358 (pr/eventlock.3 -> pr/eventlock.4, compare) to fix conflict with #165

@ryanofsky
Copy link
Collaborator Author

Rebased 2214358 -> 6715a96 (pr/eventlock.4 -> pr/eventlock.5, compare) to fix conflicts with #172

ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 5, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 5, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client flush()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the flush() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 5, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 5, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client flush()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the flush() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 5, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client flush()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the flush() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 6, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client flush()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the flush() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
@ryanofsky
Copy link
Collaborator Author

Rewrote the PR description since this is now a dependency of bitcoin/bitcoin#32345 and needed to fix bugs #123 and #176

Antoine Poinsot <[email protected]> reported a bug with details in
bitcoin-core#182 (comment)
where an IPC client rapidly connecting and disconnecting to the server in a
loop could cause problems.

One problem this uncovered was hangs on shutdown fixed by the previous commit.

Another problem it uncovered is that if a disconnect happened while the IPC
server was executing an IPC call, and the onDisconnect() handling code ran
before the IPC call finished, memory corruption would occur usually leading to
a hang or crash.

This happend because the ~ProxyServerBase() destructor was written with the
assumption that it would always be called either from the ~Connection()
destructor, or before that point, so its m_context.connection pointer would
always be valid. Typically this was the case, but not if the connection was
closed during an active IPC call.

A unit test to reproduce this error is added in the subsequent commit.

The fix for the bug here is actually a code simplification.  Previously
~ProxyServerBase() destructor would append a std::function callback destroying
the ProxyServerBase::m_impl object to
m_context.connection->m_async_cleanup_fns, so the object destructor could be
called asynchronously without blocking the event loop. Now it just appends the
same callback function to m_context.loop->m_async_fns without going through the
Connection object.

The new code is an improvement over the previous code in two other ways:

- It is a code simplification since previous code was needlessly complicated.

- In the case where there is no disconnect, and the remote ProxyClient is
  destroyed, and no "destroy" method is declared, this change causes the
  ProxyServer::m_impl object to be freed shortly after the client is freed,
  instead of being delayed until the connection is closed. (If a "destroy"
  method is declared, this change has no effect because in that case destroying
  the ProxyClient calls destroy and destroys ProxyServer::m_impl synchronously.)

The previous code was trying to make the ProxyServer cleanup with
Connection::m_async_cleanup_fns mirror the ProxyClient cleanup with
Connection::m_sync_cleanup_fns, but it was just fragile and unncecessarily
complicated.

Some comments about ProxyClient cleanup code are updated for clarity here but
no changes are made.
This change adds two unit tests to make sure shutdown happens cleanly if the
the client disconnects while the server is processing an IPC call.

The first test checks a synchronous case which has always worked, where the
disconnect starts but is not processed by the event loop until after the IPC
call completes.

The second test checks an asynchronous case which was fixed by the previous
commit, where the disconnect is processed before the IPC call generates a
response. This test would usually crash or hang prior to the fix in the
previous commit. It would crash reliably with address sanitizer.
ryanofsky added a commit to ryanofsky/libmultiprocess that referenced this pull request Jun 13, 2025
ryanofsky added a commit to ryanofsky/libmultiprocess that referenced this pull request Jun 13, 2025
@ryanofsky
Copy link
Collaborator Author

ryanofsky commented Jun 13, 2025

Rebased 0b45cb5 -> 4c33734 (pr/eventlock.6 -> pr/eventlock.7, compare) due to conflict with #181, also adding 3 commits to fix #182, and making minor cleanups to earlier commits (mainly comments)
Updated 4c33734 -> 2dcfb46 (pr/eventlock.7 -> pr/eventlock.8, compare) to fix spurious thread safety warning https://cirrus-ci.com/task/4956347122319360 and spelling errors
Updated 2dcfb46 -> 84cf56a (pr/eventlock.8 -> pr/eventlock.9, compare) to fix spurious thread safety warning https://cirrus-ci.com/task/5066891695226880

ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 13, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 13, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client flush()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the flush() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 13, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 13, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client stop()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the stop() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/libmultiprocess that referenced this pull request Jun 13, 2025
doc: Document ProxyClientBase destroy_connection option

Suggested by Sjors Provoost <[email protected]>
bitcoin-core#160 (comment)
ryanofsky added a commit to ryanofsky/libmultiprocess that referenced this pull request Jun 13, 2025
doc: Document ProxyClientBase destroy_connection option

Suggested by Sjors Provoost <[email protected]>
bitcoin-core#160 (comment)
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 16, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 16, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client stop()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the stop() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 16, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 16, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client stop()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the stop() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 16, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 16, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client stop()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the stop() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 16, 2025
Use EventLoopRef to avoid reference counting bugs and be more exception safe
and deal with removal of addClient/removeClient methods in
bitcoin-core/libmultiprocess#160

A test update is also required due to
bitcoin-core/libmultiprocess#160 to deal with changed
reference count semantics. In IpcPipeTest(), it is is now necessary to destroy
the client Proxy object instead of just the client Connection object to
decrease the event loop reference count and allow the loop to exit so the test
does not hang on shutdown.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 16, 2025
This fixes an error reported by Antoine Poinsot <[email protected]> in
bitcoin-core/libmultiprocess#123 that does not happen
in master, but does happen with bitcoin#10102
applied, where if the child bitcoin-wallet process is killed (either by an
external signal or by Ctrl-C as reported in the issue) the bitcoin-node process
will not shutdown cleanly after that because chain client stop()
calls will fail.

This change fixes the problem by handling ipc::Exception errors thrown during
the stop() calls, and it relies on the fixes to disconnect detection
implemented in bitcoin-core/libmultiprocess#160 to work
effectively.
{
public:
explicit EventLoopRef(EventLoop& loop, std::unique_lock<std::mutex>* lock = nullptr);
EventLoopRef(EventLoopRef&& other) noexcept : m_loop(other.m_loop) { other.m_loop = nullptr; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell it is correct to not copy over the lock from the moved object here as long as the moved object has no lock. Should that be asserted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #160 (comment)

In commit "proxy-io.h: Add EventLoopRef RAII class handle addClient/removeClient refcounting" (9aaeec3)

As far as I can tell it is correct to not copy over the lock from the moved object here as long as the moved object has no lock. Should that be asserted?

It's kind of a corner case, and I think either behavior could be useful here. The point of this constructor is to be able to efficiently move a reference count from one owner to another would pointlessly incrementing & decrement the count when creating one reference and dropping the other.

I feel like it probably would not be useful to copy the m_lock pointer when doing that because practical effect of copying would be to make the new EventLoopRef object destructor skip locking when it is destroyed (it would assert m_lock is locked instead).

On the other hand the default move-constructor would copy the m_lock value so maybe it is less surprising to follow the default behavior. And if the caller really wanted m_lock to be null after moving they could set it to null themselves.

So this is a good catch and I think I'd now lean towards adding m_lock{other.m_lock} but I don't think the difference should matter in practice.

More broadly I was also thinking about whether I could split EventLoopRef into a superclass without an m_lock member and a subclass with one to be able to get rid of the relock option added in ea38392 which requires turning off thread safety analysis in the reset function. At the moment this class seems good at preventing reference counting bugs because it guarantees reference counts will always be decremented if they are incremented, but it doesn't have very easy to understand locking semantics so that's probably an area for improvement.

while (!done()) {
if (!m_async_fns.empty()) {
while (m_async_fns) {
if (!m_async_fns->empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance of encountering a race condition here between checking m_async_fns and calling ->empty()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #160 (comment)

In commit "Prevent EventLoop async cleanup thread early exit during shutdown" (ea38392)

Is there a chance of encountering a race condition here between checking m_async_fns and calling ->empty()?

There shouldn't be just because m_async_fns is guarded by m_mutex which is locked above.

@@ -215,5 +217,69 @@ KJ_TEST("Calling IPC method after server connection is closed")
KJ_EXPECT(disconnected);
}

KJ_TEST("Calling IPC method and disconnecting during the call")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous test exercises a server disconnect. Would it be useful to either add another case, or expand it to test the server going away during a client call? I tried just setting

    setup.server->m_impl->m_fn = setup.server_disconnect;

but that crashes, which I guess is kind of expected, since we're tearing down objects that are used in the server thread.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #160 (comment)

In commit "test: Test disconnects during IPC calls" (84cf56a)

The previous test exercises a server disconnect. Would it be useful to either add another case, or expand it to test the server going away during a client call?

That's a good idea. It should be legitimate for an IPC call on the server to close the connection which triggered the call, or for some other thread on the server to close the connection while a slow IPC call is executing. So it would make sense to add tests covering these cases and I'm a little surprised the change you tested doesn't work.

I should look into what the cause of the problem is but I suspect it is something simple like an unchecked null pointer or uncaught exception. I am beginning to learn that there are many different ways connections can be disconnected and interrupted and in capnproto each way tends to lead to a different piece of code throwing an exception. #183 is another case found today. It seems like I should probably add broader exception handlers instead of letting errors bubble up so far.

@TheCharlatan
Copy link
Collaborator

For what it's worth I think the code here could be merged. Further improving exception handling does not all have to happen in here. Getting the EventLoop locking cleanups in is a clear improvement on the baseline in my eyes.

@ryanofsky
Copy link
Collaborator Author

ryanofsky commented Jun 17, 2025

re: #160 (comment)

For what it's worth I think the code here could be merged. Further improving exception handling does not all have to happen in here. Getting the EventLoop locking cleanups in is a clear improvement on the baseline in my eyes.

Yes that makes a lot of sense. There are more improvements to be made but I don't think it's good for this PR to keep growing forever just because the next set of improvements depend on current ones.

So I'll plan to not make any more changes here and merge in the next day or so, but I would appreciate if you or Sjors or other reviewers could look at this a little more and add acks if appropriate before then. As mentioned #172 (comment) I would really like to get away from the practice of merging PRs here without current acks.

Copy link
Collaborator

@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 84cf56a

@ryanofsky ryanofsky merged commit 258a617 into bitcoin-core:master Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bitcoin-node segfaults when interrupted
4 participants