Skip to content

P2P: fix handshake race, harden conn pooling, and speed up cascade hot-path#148

Merged
mateeullahmalik merged 9 commits intomasterfrom
fixP2PConncurrencyIssues
Sep 2, 2025
Merged

P2P: fix handshake race, harden conn pooling, and speed up cascade hot-path#148
mateeullahmalik merged 9 commits intomasterfrom
fixP2PConncurrencyIssues

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Collaborator

P2P: fix handshake race, harden conn pooling, sanitize addresses, and speed up cascade hot-path

Summary

This PR removes the intermittent timeouts and “remote identity mismatch” on healthy peers and makes the cascade hot-path (BatchFindNode + BatchStoreData) consistently fast under churn.

Themes

  • Correctness: per-connection credentials (no shared mutation), full-RPC serialization per connection, safe pool keys, address hygiene.
  • Performance: per-operation deadlines, no global/stale deadlines on pooled conns, higher safe concurrency, tighter timeouts.
  • Operability: fewer false ignorelists, cleaner routing, stable discovery/store behavior.

What changed

  • TLS creds (handshake)

    • Implemented LumeraTC.Clone() (deep copies ProtocolInfo, clears remoteIdentity).
    • NewSecureClientConn now clones creds per dial and sets remoteIdentity on the clone before ClientHandshake (no shared mutation/race).
  • Connection pooling / RPC pipeline

    • Pool key now uses base58(Receiver.ID)@ip:port (no raw-byte strings).
    • Dial occurs outside the pool lock with a double-check add to avoid blocking and races.
    • One in-flight RPC per connection: for pooled connWrapper, we lock across write+read (prevents response cross-talk).
    • Per-operation deadlines:
      • SetWriteDeadline(now+3s)Write
      • SetReadDeadline(now+timeout)decode
      • SetDeadline(time.Time{}) after success (clear for reuse)
    • On I/O error, evict/close the conn after unlocking (no deadlock).
  • Address hygiene

    • Reject 0.0.0.0, loopback, link-local, and (public overlay) RFC1918/private IPs on ingest (addKnownNodes).
    • Prevent publishing bind/private addresses in newMessage; prefer configured Advertise IP (never broadcast 0.0.0.0).
  • Hot-path tuning

    • Timeouts: BatchStoreData12s (from 60s), BatchFindNode12–15s.
    • Concurrency: batchFindNode up to 64 in-flight; batchStoreNetwork up to 16 in-flight.

Why this helps

  • Handshake stability: per-dial cloned creds remove the remote-identity race → handshake timeouts on healthy peers drop sharply.
  • RPC correctness: full-RPC lock eliminates interleaved responses → no random decode stalls.
  • Lower tail latency: fast write deadlines, right-sized read budgets, and cleared deadlines on reuse stop “mystery” timeouts.
  • Cleaner routing: garbage/self/private endpoints are filtered out before the hot path.
  • Throughput: dialing doesn’t hold the pool lock; higher safe concurrency keeps fan-outs fast.

Risks & mitigations

  • Clone correctness: ensure Clone() deep-copies ProtocolInfo and doesn’t share mutable state.
    • Mitigation: unit test for Clone(); code review focus here.
  • Higher concurrency: increased outbound fan-out may raise egress/FD usage.
    • Mitigation: capped concurrency (64/16), short timeouts; monitor system limits.
  • Private overlays: address filters may exclude RFC1918 if running in a private mesh.
    • Mitigation: gate via config; document behavior.
  • Advertise IP: require ops to set an externally reachable address.

Testing

  • Unit
    • Clone() deep-copy behavior; remoteIdentity not retained.
    • Remote-identity race: two concurrent dials with different IDs → no flake under -race.
    • Per-connection serialization: two goroutines to same peer succeed without cross-talk.
  • Integration
    • BatchFindNode fan-out with mixed healthy/offline peers converges within timeout window.
    • BatchStoreData over mixed targets completes in seconds; fewer ignorelist increments.
  • Manual
    • Verified no 0.0.0.0/loopback/private IPs in routing tables.
    • Verified pool evictions only on I/O errors and after unlock.

TL;DR: Eliminates the handshake race, hardens pooling & deadlines, sanitizes addresses, and dials in concurrency/timeouts. Expect a big drop in false timeouts and a seconds-level cascade hot path.

@mateeullahmalik mateeullahmalik force-pushed the fixP2PConncurrencyIssues branch from 98e1e20 to cb2ceab Compare September 1, 2025 15:16
j-rafique
j-rafique previously approved these changes Sep 1, 2025
@mateeullahmalik mateeullahmalik dismissed j-rafique’s stale review September 1, 2025 21:18

The merge-base changed after approval.

@mateeullahmalik mateeullahmalik force-pushed the fixP2PConncurrencyIssues branch from 978ab0a to 5e0df94 Compare September 1, 2025 21:18
@mateeullahmalik mateeullahmalik force-pushed the fixP2PConncurrencyIssues branch from afc1bb9 to 1970313 Compare September 2, 2025 07:36
j-rafique
j-rafique previously approved these changes Sep 2, 2025
@mateeullahmalik mateeullahmalik merged commit e123eb2 into master Sep 2, 2025
12 checks passed
@mateeullahmalik mateeullahmalik deleted the fixP2PConncurrencyIssues branch September 5, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants