Skip to content

P2P: Add first-class runtime metrics for DHT & Network#156

Merged
mateeullahmalik merged 1 commit intomasterfrom
p2pMetrics
Sep 5, 2025
Merged

P2P: Add first-class runtime metrics for DHT & Network#156
mateeullahmalik merged 1 commit intomasterfrom
p2pMetrics

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Collaborator

P2P: First-class runtime metrics for DHT, HashTable, BanList & Network (handlers + server)

Summary

This PR lands a complete observability pass across the P2P stack. We now expose
lock-safe, low-overhead metrics for:

  • DHT hot paths (batch store/retrieve; ban/skip counters; rolling success history)
  • HashTable shape & health (bucket load, saturation, refresh pressure)
  • BanList (current banned set, age, churn, increments)
  • Network
    • Server/accept loop & per-connection I/O (handshake failures, read timeouts, write failures)
    • Per-message-type handler counters (total/success/failure/timeout)
  • Snapshots are returned via DHT.Stats() (no API changes to message wire format).

The goal: quickly answer “Are we healthy?” and pinpoint issues (bad peers, timeouts,
bucket starvation, uneven routing) without external instrumentation.


What’s New

1) DHT metrics (rolling / hot-path)

  • Rolling store success points for IterateBatchStore:
    • requests, successful, success_rate, time
  • Rolling batch-retrieve points:
    • keys, required, found_local, found_network, duration, time
  • Hot-path counters:
    • hot_path_banned_skips — operations that skipped a peer due to a current ban
    • hot_path_ban_increments — times we incremented a peer’s ban counter (non-local failures)

Implementation notes:

  • Rolling windows are bounded (e.g., 48 points) & updated once per batch (not per key).
  • Counters are atomics with zero extra allocations on the hot path.

2) HashTable metrics (shape & pressure)

We add a snapshot of the routing table’s health. Key fields:

  • total_nodes — nodes we currently track across all buckets
  • buckets_occupied / buckets_total
  • avg_bucket_load / max_bucket_load — load vs. K
  • buckets_full — buckets at capacity K
  • buckets_near_full — buckets at K-1
  • buckets_due_refresh — buckets beyond defaultRefreshTime
  • ignored_overlap — count of nodes in buckets that also exist in BanList snapshot (signals route pollution)

This helps detect saturation (frequent replacement pressure), underfill (poor connectivity),
and staleness (too many buckets due for refresh).

3) BanList metrics (state & churn)

Alongside the existing snapshot of banned nodes (ID/IP/Port/Count/Age), we now include:

  • banned_now — number of nodes over the ban threshold
  • ban_threshold — configured threshold value
  • avg_ban_age_seconds — mean age of banned entries
  • ban_increments_total — cumulative increments (tie back to hot path)
  • purge_runs_total — periodic purge cycles executed
  • banned_topN — top offenders by count (trimmed)

This lets us spot noisy neighbors, churn, and whether the threshold is calibrated.

4) Network metrics — per-message handler

For each message type (e.g., Ping, FindNode, BatchGetValues, …) we record:

  • total, success, failure, timeout

Timeouts are derived from the read deadline policy (we now use a helper to tailor
read deadlines to the message type & overall budget), not from general runtime errors.

5) Network server metrics — accept loop & I/O

We add server-side metrics that track connection lifecycle and I/O outcomes:

  • Accept loop:
    • accept_total
    • accept_temp_errors — EAGAIN/ECONNRESET/EINTR/ETIMEDOUT, etc. (back-off path)
    • accept_fatal_errors — unrecoverable errors that stop the server
  • Handshake:
    • handshake_success
    • handshake_failure
  • Active connection gauge:
    • active_conns_current, active_conns_peak
  • Conn I/O:
    • conn_read_timeout — handleConn read timeouts (connection kept open)
    • conn_read_failures — non-timeout decode/read errors (connection closed)
    • conn_write_failures — write/flush errors (connection closed)

These are independent from the handler metrics and reflect transport reliability.

6) Stats surfacing (no API change)

All metrics are now exposed through DHT.Stats():

// dht.go: Stats(ctx)
{
  "self": {...},
  "peers_count": 713,
  "peers": [...],
  "hashtable": {
    "total_nodes": 713,
    "buckets_occupied": 116,
    "buckets_total": 160,
    "avg_bucket_load": 5.9,
    "max_bucket_load": 20,
    "buckets_full": 19,
    "buckets_near_full": 11,
    "buckets_due_refresh": 7,
    "ignored_overlap": 12
  },
  "banlist": {
    "banned_now": 31,
    "ban_threshold": 1,
    "avg_ban_age_seconds": 4893,
    "ban_increments_total": 67,
    "purge_runs_total": 24,
    "banned_topN": [
      {"id":"...","ip":"1.2.3.4","port":4445,"count":4,"age_seconds":7200},
      {"id":"...","ip":"5.6.7.8","port":4445,"count":3,"age_seconds":3600}
    ],
    "snapshot": [ ... full snapshot trimmed ... ]
  },
  "network": {
    "server": {
      "accept_total": 10123,
      "accept_temp_errors": 14,
      "accept_fatal_errors": 0,
      "handshake_success": 10085,
      "handshake_failure": 22,
      "active_conns_current": 133,
      "active_conns_peak": 201,
      "conn_read_timeout": 318,
      "conn_read_failures": 9,
      "conn_write_failures": 7
    },
    "handlers": {
      "Ping":            {"total": 2134, "success": 2134, "failure": 0, "timeout": 0},
      "FindNode":        {"total": 1588, "success": 1587, "failure": 1, "timeout": 0},
      "BatchFindNode":   {"total": 437,  "success": 432,  "failure": 3, "timeout": 2},
      "FindValue":       {"total": 812,  "success": 805,  "failure": 4, "timeout": 3},
      "BatchFindValues": {"total": 121,  "success": 118,  "failure": 1, "timeout": 2},
      "BatchGetValues":  {"total": 97,   "success": 90,   "failure": 2, "timeout": 5},
      "StoreData":       {"total": 420,  "success": 417,  "failure": 2, "timeout": 1},
      "BatchStoreData":  {"total": 83,   "success": 81,   "failure": 1, "timeout": 1},
      "Replicate":       {"total": 64,   "success": 63,   "failure": 0, "timeout": 1}
    }
  },
  "dht_metrics": {
    "store_success_recent": [
      {"time":"...Z","requests":108,"successful":104,"success_rate":96.30},
      {"time":"...Z","requests":97, "successful":95, "success_rate":97.94}
    ],
    "batch_retrieve_recent": [
      {"time":"...Z","keys":2500,"required":2500,"found_local":1732,"found_network":768,"duration":"1.83s"}
    ],
    "hot_path_banned_skips": 412,
    "hot_path_ban_increments": 67
  },
  "database": {...}
}

@mateeullahmalik mateeullahmalik merged commit 0aba363 into master Sep 5, 2025
7 checks passed
@mateeullahmalik mateeullahmalik deleted the p2pMetrics branch September 5, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants