-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(p2p_network/sync_handlers): sync handlers wait for DB op to finish causing p2p server's swarm to stall #2594
Conversation
e40443d
to
195335d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kudos for the 🕵️ work
dd6f35a
to
0352a2b
Compare
But: if
isn't it generally unsafe to use
|
Maybe we should raise it upstream, as an error? Either they should deny that happens, or at least document it with a bigger warning... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
c1411fb
to
94940a6
Compare
…h causing p2p server's swarm to stall This causes the libp2p swarm to stall given enough streams are utilized per connection during sync: - a sync request event is handled - sync handler waits till DB finishes, keeping event channel full - backpressure through the event related channel is exerted on the main loop - polling swarm for newer events cannot proceed - swarms' internal event queues for each connection fill up - swarm becomes unresponsive
94940a6
to
b089f6e
Compare
I removed the solution that mitigated the problem (ie. inflating buffers in the swarm) and added a proper fix. The description of the PR is also updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for finding the real cause!
Fixes: #2351
Problem
When a client makes
N
concurrent sync requests to pathfinder over the same sync protocol:N>3
pathfinder responds but pending connections from other peers are held upN>7
pathfinder initially responds but then just stops responding to the first client and does not react to other peers trying to connectThe mechanism causing this issue
It turns out I was wrong the first time I approached this issue and wrongly accused
SelectAll
in swarm's connection pool for the stalling. The fact thatSelectAll
stalls in that connection pool is only a symptom of processing slowing down elsewhere, this is what actually happens:p2p_stream
emitsInboundRequest
event, which is caught in the main loop and then re-emited outside the main loop asInbound*SyncRequest
through this channelInbound*SyncRequest
event is taken from the channel here and a proper sync handler is calledswarm.next()
Config changes
The default value of
max_concurrent_streams
is back to 100.Tests performed
Snapshot: sepolia
Number of clients: 10