P2P: Add first-class runtime metrics for DHT & Network by mateeullahmalik · Pull Request #156 · LumeraProtocol/supernode

mateeullahmalik · 2025-09-04T15:15:53Z

P2P: First-class runtime metrics for DHT, HashTable, BanList & Network (handlers + server)

Summary

This PR lands a complete observability pass across the P2P stack. We now expose
lock-safe, low-overhead metrics for:

DHT hot paths (batch store/retrieve; ban/skip counters; rolling success history)
HashTable shape & health (bucket load, saturation, refresh pressure)
BanList (current banned set, age, churn, increments)
Network
- Server/accept loop & per-connection I/O (handshake failures, read timeouts, write failures)
- Per-message-type handler counters (total/success/failure/timeout)
Snapshots are returned via DHT.Stats() (no API changes to message wire format).

The goal: quickly answer “Are we healthy?” and pinpoint issues (bad peers, timeouts,
bucket starvation, uneven routing) without external instrumentation.

What’s New

1) DHT metrics (rolling / hot-path)

Rolling store success points for IterateBatchStore:
- requests, successful, success_rate, time
Rolling batch-retrieve points:
- keys, required, found_local, found_network, duration, time
Hot-path counters:
- hot_path_banned_skips — operations that skipped a peer due to a current ban
- hot_path_ban_increments — times we incremented a peer’s ban counter (non-local failures)

Implementation notes:

Rolling windows are bounded (e.g., 48 points) & updated once per batch (not per key).

Counters are atomics with zero extra allocations on the hot path.

2) HashTable metrics (shape & pressure)

We add a snapshot of the routing table’s health. Key fields:

total_nodes — nodes we currently track across all buckets
buckets_occupied / buckets_total
avg_bucket_load / max_bucket_load — load vs. K
buckets_full — buckets at capacity K
buckets_near_full — buckets at K-1
buckets_due_refresh — buckets beyond defaultRefreshTime
ignored_overlap — count of nodes in buckets that also exist in BanList snapshot (signals route pollution)

This helps detect saturation (frequent replacement pressure), underfill (poor connectivity),
and staleness (too many buckets due for refresh).

3) BanList metrics (state & churn)

Alongside the existing snapshot of banned nodes (ID/IP/Port/Count/Age), we now include:

banned_now — number of nodes over the ban threshold
ban_threshold — configured threshold value
avg_ban_age_seconds — mean age of banned entries
ban_increments_total — cumulative increments (tie back to hot path)
purge_runs_total — periodic purge cycles executed
banned_topN — top offenders by count (trimmed)

This lets us spot noisy neighbors, churn, and whether the threshold is calibrated.

4) Network metrics — per-message handler

For each message type (e.g., Ping, FindNode, BatchGetValues, …) we record:

total, success, failure, timeout

Timeouts are derived from the read deadline policy (we now use a helper to tailor
read deadlines to the message type & overall budget), not from general runtime errors.

5) Network server metrics — accept loop & I/O

We add server-side metrics that track connection lifecycle and I/O outcomes:

Accept loop:
- accept_total
- accept_temp_errors — EAGAIN/ECONNRESET/EINTR/ETIMEDOUT, etc. (back-off path)
- accept_fatal_errors — unrecoverable errors that stop the server
Handshake:
- handshake_success
- handshake_failure
Active connection gauge:
- active_conns_current, active_conns_peak
Conn I/O:
- conn_read_timeout — handleConn read timeouts (connection kept open)
- conn_read_failures — non-timeout decode/read errors (connection closed)
- conn_write_failures — write/flush errors (connection closed)

These are independent from the handler metrics and reflect transport reliability.

6) Stats surfacing (no API change)

All metrics are now exposed through DHT.Stats():

// dht.go: Stats(ctx)
{
  "self": {...},
  "peers_count": 713,
  "peers": [...],
  "hashtable": {
    "total_nodes": 713,
    "buckets_occupied": 116,
    "buckets_total": 160,
    "avg_bucket_load": 5.9,
    "max_bucket_load": 20,
    "buckets_full": 19,
    "buckets_near_full": 11,
    "buckets_due_refresh": 7,
    "ignored_overlap": 12
  },
  "banlist": {
    "banned_now": 31,
    "ban_threshold": 1,
    "avg_ban_age_seconds": 4893,
    "ban_increments_total": 67,
    "purge_runs_total": 24,
    "banned_topN": [
      {"id":"...","ip":"1.2.3.4","port":4445,"count":4,"age_seconds":7200},
      {"id":"...","ip":"5.6.7.8","port":4445,"count":3,"age_seconds":3600}
    ],
    "snapshot": [ ... full snapshot trimmed ... ]
  },
  "network": {
    "server": {
      "accept_total": 10123,
      "accept_temp_errors": 14,
      "accept_fatal_errors": 0,
      "handshake_success": 10085,
      "handshake_failure": 22,
      "active_conns_current": 133,
      "active_conns_peak": 201,
      "conn_read_timeout": 318,
      "conn_read_failures": 9,
      "conn_write_failures": 7
    },
    "handlers": {
      "Ping":            {"total": 2134, "success": 2134, "failure": 0, "timeout": 0},
      "FindNode":        {"total": 1588, "success": 1587, "failure": 1, "timeout": 0},
      "BatchFindNode":   {"total": 437,  "success": 432,  "failure": 3, "timeout": 2},
      "FindValue":       {"total": 812,  "success": 805,  "failure": 4, "timeout": 3},
      "BatchFindValues": {"total": 121,  "success": 118,  "failure": 1, "timeout": 2},
      "BatchGetValues":  {"total": 97,   "success": 90,   "failure": 2, "timeout": 5},
      "StoreData":       {"total": 420,  "success": 417,  "failure": 2, "timeout": 1},
      "BatchStoreData":  {"total": 83,   "success": 81,   "failure": 1, "timeout": 1},
      "Replicate":       {"total": 64,   "success": 63,   "failure": 0, "timeout": 1}
    }
  },
  "dht_metrics": {
    "store_success_recent": [
      {"time":"...Z","requests":108,"successful":104,"success_rate":96.30},
      {"time":"...Z","requests":97, "successful":95, "success_rate":97.94}
    ],
    "batch_retrieve_recent": [
      {"time":"...Z","keys":2500,"required":2500,"found_local":1732,"found_network":768,"duration":"1.83s"}
    ],
    "hot_path_banned_skips": 412,
    "hot_path_ban_increments": 67
  },
  "database": {...}
}

add p2p metrics

718a000

mateeullahmalik requested a review from j-rafique September 5, 2025 11:10

j-rafique approved these changes Sep 5, 2025

View reviewed changes

mateeullahmalik merged commit 0aba363 into master Sep 5, 2025
7 checks passed

mateeullahmalik deleted the p2pMetrics branch September 5, 2025 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2P: Add first-class runtime metrics for DHT & Network#156

P2P: Add first-class runtime metrics for DHT & Network#156
mateeullahmalik merged 1 commit intomasterfrom
p2pMetrics

mateeullahmalik commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mateeullahmalik commented Sep 4, 2025

P2P: First-class runtime metrics for DHT, HashTable, BanList & Network (handlers + server)

Summary

What’s New

1) DHT metrics (rolling / hot-path)

2) HashTable metrics (shape & pressure)

3) BanList metrics (state & churn)

4) Network metrics — per-message handler

5) Network server metrics — accept loop & I/O

6) Stats surfacing (no API change)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants