fix(p2p_network/sync_handlers): sync handlers wait for DB op to finish causing p2p server's swarm to stall #2594

CHr15F0x · 2025-02-14T11:47:08Z

Problem

When a client makes N concurrent sync requests to pathfinder over the same sync protocol:

N>3 pathfinder responds but pending connections from other peers are held up
N>7 pathfinder initially responds but then just stops responding to the first client and does not react to other peers trying to connect

The mechanism causing this issue

It turns out I was wrong the first time I approached this issue and wrongly accused SelectAll in swarm's connection pool for the stalling. The fact that SelectAll stalls in that connection pool is only a symptom of processing slowing down elsewhere, this is what actually happens:

a sync request is received, p2p_stream emits InboundRequest event, which is caught in the main loop and then re-emited outside the main loop as Inbound*SyncRequest through this channel
this Inbound*SyncRequest event is taken from the channel here and a proper sync handler is called
sync handler waits till DB finishes, so the event channel becomes full quickly
backpressure through the event channel is exerted on the main loop, as the channel fills up and some of the next attempts to forward an event from the main loop results in waiting on a full channel here
polling swarm for newer events cannot proceed, we hang on swarm.next()
swarms' internal event queues for each connection fill up, SelectAll cannot move forward, it's not being polled fast enough
swarm becomes unresponsive to external events

Config changes

The default value of max_concurrent_streams is back to 100.

Tests performed

Snapshot: sepolia
Number of clients: 10

Streams per client	Blocks per request	Block range
100	1	0-100
100	10	0-1k
100	100	0-10k
200	100	0-20k
500	1	0-500
500	10	0-5k

t00ts

Kudos for the 🕵️ work

kkovaacs · 2025-02-18T08:06:53Z

But: if

futures::mpsc only notifies when successfully adding a value to the channel
and futures::SelectAll only polls futures that are notified

isn't it generally unsafe to use SelectAll with futures::mpsc? The rust-libp2p comment seems to assume that even if the mpsc channel is full things will degrade normally (it explicitly mentions using this mechanism as a way of applying back-pressure):

/// When the buffer is full, the background tasks of all connections will stall.
/// In this way, the consumers of network events exert back-pressure on
/// the network connection I/O.

vbar · 2025-02-18T08:11:59Z

isn't it generally unsafe to use SelectAll with futures::mpsc? The rust-libp2p comment seems to assume that even if the mpsc channel is full things will degrade normally (it explicitly mentions using this mechanism as a way of applying back-pressure):

Maybe we should raise it upstream, as an error? Either they should deny that happens, or at least document it with a bigger warning...

kkovaacs

LGTM

…h causing p2p server's swarm to stall This causes the libp2p swarm to stall given enough streams are utilized per connection during sync: - a sync request event is handled - sync handler waits till DB finishes, keeping event channel full - backpressure through the event related channel is exerted on the main loop - polling swarm for newer events cannot proceed - swarms' internal event queues for each connection fill up - swarm becomes unresponsive

CHr15F0x · 2025-02-19T11:53:05Z

I removed the solution that mitigated the problem (ie. inflating buffers in the swarm) and added a proper fix. The description of the PR is also updated.

kkovaacs

LGTM, thanks for finding the real cause!

CHr15F0x force-pushed the chris/p2p_stream_hangs2 branch 4 times, most recently from e40443d to 195335d Compare February 17, 2025 07:51

CHr15F0x marked this pull request as ready for review February 17, 2025 09:16

CHr15F0x requested a review from a team as a code owner February 17, 2025 09:16

vbar approved these changes Feb 17, 2025

View reviewed changes

t00ts approved these changes Feb 17, 2025

View reviewed changes

CHr15F0x force-pushed the chris/p2p_stream_hangs2 branch from dd6f35a to 0352a2b Compare February 17, 2025 16:04

kkovaacs approved these changes Feb 18, 2025

View reviewed changes

CHr15F0x marked this pull request as draft February 18, 2025 16:15

CHr15F0x changed the title ~~fix(p2p): insufficient size of per connection event buffer causes p2p server's swarm to stall~~ fix(p2p_network/sync_handlers): sync handlers wait for DB op to finish causing p2p server's swarm to stall Feb 19, 2025

CHr15F0x force-pushed the chris/p2p_stream_hangs2 branch 2 times, most recently from c1411fb to 94940a6 Compare February 19, 2025 11:27

CHr15F0x marked this pull request as ready for review February 19, 2025 11:32

CHr15F0x added 5 commits February 19, 2025 12:32

chore: fixup error message and comment

fe62c5d

test: add a stress test sync client

9e7312a

feat: increase max_concurrent_streams to 100

91a75a6

feat(stress_test_sync_client): add blocks-per-request option

b089f6e

CHr15F0x force-pushed the chris/p2p_stream_hangs2 branch from 94940a6 to b089f6e Compare February 19, 2025 11:32

CHr15F0x requested review from vbar, t00ts and kkovaacs February 19, 2025 12:05

vbar approved these changes Feb 19, 2025

View reviewed changes

kkovaacs approved these changes Feb 19, 2025

View reviewed changes

CHr15F0x merged commit b320ce6 into main Feb 19, 2025
8 checks passed

CHr15F0x deleted the chris/p2p_stream_hangs2 branch February 19, 2025 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(p2p_network/sync_handlers): sync handlers wait for DB op to finish causing p2p server's swarm to stall #2594

fix(p2p_network/sync_handlers): sync handlers wait for DB op to finish causing p2p server's swarm to stall #2594

CHr15F0x commented Feb 14, 2025 •

edited

Loading

t00ts left a comment

kkovaacs commented Feb 18, 2025

vbar commented Feb 18, 2025

kkovaacs left a comment

CHr15F0x commented Feb 19, 2025

kkovaacs left a comment

fix(p2p_network/sync_handlers): sync handlers wait for DB op to finish causing p2p server's swarm to stall #2594

fix(p2p_network/sync_handlers): sync handlers wait for DB op to finish causing p2p server's swarm to stall #2594

Conversation

CHr15F0x commented Feb 14, 2025 • edited Loading

Problem

The mechanism causing this issue

Config changes

Tests performed

t00ts left a comment

Choose a reason for hiding this comment

kkovaacs commented Feb 18, 2025

vbar commented Feb 18, 2025

kkovaacs left a comment

Choose a reason for hiding this comment

CHr15F0x commented Feb 19, 2025

kkovaacs left a comment

Choose a reason for hiding this comment

CHr15F0x commented Feb 14, 2025 •

edited

Loading