Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(p2p_network/sync_handlers): sync handlers wait for DB op to finish causing p2p server's swarm to stall #2594

Merged
merged 5 commits into from
Feb 19, 2025

Conversation

CHr15F0x
Copy link
Member

@CHr15F0x CHr15F0x commented Feb 14, 2025

Fixes: #2351

Problem

When a client makes N concurrent sync requests to pathfinder over the same sync protocol:

  • N>3 pathfinder responds but pending connections from other peers are held up
  • N>7 pathfinder initially responds but then just stops responding to the first client and does not react to other peers trying to connect

The mechanism causing this issue

It turns out I was wrong the first time I approached this issue and wrongly accused SelectAll in swarm's connection pool for the stalling. The fact that SelectAll stalls in that connection pool is only a symptom of processing slowing down elsewhere, this is what actually happens:

Config changes

The default value of max_concurrent_streams is back to 100.

Tests performed

Snapshot: sepolia
Number of clients: 10

Streams per client Blocks per request Block range
100 1 0-100
100 10 0-1k
100 100 0-10k
200 100 0-20k
500 1 0-500
500 10 0-5k

@CHr15F0x CHr15F0x force-pushed the chris/p2p_stream_hangs2 branch 4 times, most recently from e40443d to 195335d Compare February 17, 2025 07:51
@CHr15F0x CHr15F0x marked this pull request as ready for review February 17, 2025 09:16
@CHr15F0x CHr15F0x requested a review from a team as a code owner February 17, 2025 09:16
Copy link
Contributor

@t00ts t00ts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos for the 🕵️ work

@CHr15F0x CHr15F0x force-pushed the chris/p2p_stream_hangs2 branch from dd6f35a to 0352a2b Compare February 17, 2025 16:04
@kkovaacs
Copy link
Contributor

But: if

  • futures::mpsc only notifies when successfully adding a value to the channel
  • and futures::SelectAll only polls futures that are notified

isn't it generally unsafe to use SelectAll with futures::mpsc? The rust-libp2p comment seems to assume that even if the mpsc channel is full things will degrade normally (it explicitly mentions using this mechanism as a way of applying back-pressure):

/// When the buffer is full, the background tasks of all connections will stall.
/// In this way, the consumers of network events exert back-pressure on
/// the network connection I/O.

@vbar
Copy link
Contributor

vbar commented Feb 18, 2025

isn't it generally unsafe to use SelectAll with futures::mpsc? The rust-libp2p comment seems to assume that even if the mpsc channel is full things will degrade normally (it explicitly mentions using this mechanism as a way of applying back-pressure):

Maybe we should raise it upstream, as an error? Either they should deny that happens, or at least document it with a bigger warning...

Copy link
Contributor

@kkovaacs kkovaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CHr15F0x CHr15F0x marked this pull request as draft February 18, 2025 16:15
@CHr15F0x CHr15F0x changed the title fix(p2p): insufficient size of per connection event buffer causes p2p server's swarm to stall fix(p2p_network/sync_handlers): sync handlers wait for DB op to finish causing p2p server's swarm to stall Feb 19, 2025
@CHr15F0x CHr15F0x force-pushed the chris/p2p_stream_hangs2 branch 2 times, most recently from c1411fb to 94940a6 Compare February 19, 2025 11:27
@CHr15F0x CHr15F0x marked this pull request as ready for review February 19, 2025 11:32
…h causing p2p server's swarm to stall

This causes the libp2p swarm to stall given enough streams are utilized per connection during sync:
- a sync request event is handled
- sync handler waits till DB finishes, keeping event channel full
- backpressure through the event related channel is exerted on the main loop
- polling swarm for newer events cannot proceed
- swarms' internal event queues for each connection fill up
- swarm becomes unresponsive
@CHr15F0x CHr15F0x force-pushed the chris/p2p_stream_hangs2 branch from 94940a6 to b089f6e Compare February 19, 2025 11:32
@CHr15F0x
Copy link
Member Author

I removed the solution that mitigated the problem (ie. inflating buffers in the swarm) and added a proper fix. The description of the PR is also updated.

@CHr15F0x CHr15F0x requested review from vbar, t00ts and kkovaacs February 19, 2025 12:05
Copy link
Contributor

@kkovaacs kkovaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for finding the real cause!

@CHr15F0x CHr15F0x merged commit b320ce6 into main Feb 19, 2025
8 checks passed
@CHr15F0x CHr15F0x deleted the chris/p2p_stream_hangs2 branch February 19, 2025 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate libp2p swarm halting after many concurrent streams are open in p2p_stream from a single peer.
4 participants