P2P: Add first-class runtime metrics for DHT & Network#156
Merged
mateeullahmalik merged 1 commit intomasterfrom Sep 5, 2025
Merged
P2P: Add first-class runtime metrics for DHT & Network#156mateeullahmalik merged 1 commit intomasterfrom
mateeullahmalik merged 1 commit intomasterfrom
Conversation
j-rafique
approved these changes
Sep 5, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
P2P: First-class runtime metrics for DHT, HashTable, BanList & Network (handlers + server)
Summary
This PR lands a complete observability pass across the P2P stack. We now expose
lock-safe, low-overhead metrics for:
DHT.Stats()(no API changes to message wire format).The goal: quickly answer “Are we healthy?” and pinpoint issues (bad peers, timeouts,
bucket starvation, uneven routing) without external instrumentation.
What’s New
1) DHT metrics (rolling / hot-path)
IterateBatchStore:requests,successful,success_rate,timekeys,required,found_local,found_network,duration,timehot_path_banned_skips— operations that skipped a peer due to a current banhot_path_ban_increments— times we incremented a peer’s ban counter (non-local failures)2) HashTable metrics (shape & pressure)
We add a snapshot of the routing table’s health. Key fields:
total_nodes— nodes we currently track across all bucketsbuckets_occupied/buckets_totalavg_bucket_load/max_bucket_load— load vs.Kbuckets_full— buckets at capacityKbuckets_near_full— buckets atK-1buckets_due_refresh— buckets beyonddefaultRefreshTimeignored_overlap— count of nodes in buckets that also exist inBanListsnapshot (signals route pollution)This helps detect saturation (frequent replacement pressure), underfill (poor connectivity),
and staleness (too many buckets due for refresh).
3) BanList metrics (state & churn)
Alongside the existing snapshot of banned nodes (ID/IP/Port/Count/Age), we now include:
banned_now— number of nodes over the ban thresholdban_threshold— configured threshold valueavg_ban_age_seconds— mean age of banned entriesban_increments_total— cumulative increments (tie back to hot path)purge_runs_total— periodic purge cycles executedbanned_topN— top offenders bycount(trimmed)This lets us spot noisy neighbors, churn, and whether the threshold is calibrated.
4) Network metrics — per-message handler
For each message type (e.g.,
Ping,FindNode,BatchGetValues, …) we record:total,success,failure,timeoutTimeouts are derived from the read deadline policy (we now use a helper to tailor
read deadlines to the message type & overall budget), not from general runtime errors.
5) Network server metrics — accept loop & I/O
We add server-side metrics that track connection lifecycle and I/O outcomes:
accept_totalaccept_temp_errors— EAGAIN/ECONNRESET/EINTR/ETIMEDOUT, etc. (back-off path)accept_fatal_errors— unrecoverable errors that stop the serverhandshake_successhandshake_failureactive_conns_current,active_conns_peakconn_read_timeout— handleConn read timeouts (connection kept open)conn_read_failures— non-timeout decode/read errors (connection closed)conn_write_failures— write/flush errors (connection closed)6) Stats surfacing (no API change)
All metrics are now exposed through
DHT.Stats():