|
| 1 | +# P2P Metrics Capture — What Each Field Means and Where It’s Collected |
| 2 | + |
| 3 | +This guide explains every field we emit in Cascade events, how it is measured, and exactly where it is captured in the code. |
| 4 | + |
| 5 | +The design is minimal by intent: |
| 6 | +- Metrics are collected only for the first pass of Register (store) and for the active Download operation. |
| 7 | +- P2P APIs return errors only; per‑RPC details are captured via a small metrics package (`pkg/p2pmetrics`). |
| 8 | +- No aggregation; we only group raw RPC attempts by IP. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## Store (Register) Event |
| 13 | + |
| 14 | +Event payload shape |
| 15 | + |
| 16 | +```json |
| 17 | +{ |
| 18 | + "store": { |
| 19 | + "duration_ms": 9876, |
| 20 | + "symbols_first_pass": 220, |
| 21 | + "symbols_total": 1200, |
| 22 | + "id_files_count": 14, |
| 23 | + "calls_by_ip": { |
| 24 | + "10.0.0.5": [ |
| 25 | + {"ip": "10.0.0.5", "address": "A:4445", "keys": 100, "success": true, "duration_ms": 120}, |
| 26 | + {"ip": "10.0.0.5", "address": "A:4445", "keys": 120, "success": false, "error": "timeout", "duration_ms": 300} |
| 27 | + ] |
| 28 | + } |
| 29 | + } |
| 30 | +} |
| 31 | +``` |
| 32 | + |
| 33 | +### Fields |
| 34 | + |
| 35 | +- `store.duration_ms` |
| 36 | + |
| 37 | +- `store.symbols_first_pass` |
| 38 | + - Meaning: Number of symbols sent during the Register first pass (across the combined first batch and any immediate first‑pass symbol batches). |
| 39 | + - Where captured: `supernode/services/cascade/adaptors/p2p.go` via `p2pmetrics.SetStoreSummary(...)` using the value returned by `storeCascadeSymbolsAndData`. |
| 40 | + |
| 41 | +- `store.symbols_total` |
| 42 | + - Meaning: Total symbols available in the symbol directory (before sampling). Used to contextualize the first‑pass coverage. |
| 43 | + - Where captured: Computed in `storeCascadeSymbolsAndData` and included in `SetStoreSummary`. |
| 44 | + |
| 45 | +- `store.id_files_count` |
| 46 | + - Meaning: Number of redundant metadata files (ID files) sent in the first combined batch. |
| 47 | + - Where captured: `len(req.IDFiles)` in `StoreArtefacts`, passed to `SetStoreSummary`. |
| 48 | + - Meaning: End‑to‑end elapsed time of the first‑pass store phase (Register’s storage section only). |
| 49 | + - Where captured: `supernode/services/cascade/adaptors/p2p.go` |
| 50 | + - A `time.Now()` timestamp is taken just before the first‑pass store function and measured on return. |
| 51 | + |
| 52 | +- `store.calls_by_ip` |
| 53 | + - Meaning: All raw network store RPC attempts grouped by the node IP. |
| 54 | + - Each array entry is a single RPC attempt with: |
| 55 | + - `ip` — Node IP (fallback to `address` if missing). |
| 56 | + - `address` — Node string `IP:port`. |
| 57 | + - `keys` — Number of items in that RPC attempt (metadata + first symbols for the first combined batch, symbols for subsequent batches within the first pass). |
| 58 | + - `success` — True if the node acknowledged the store successfully. |
| 59 | + - `error` — Any error string captured; omitted when success. |
| 60 | + - `duration_ms` — RPC duration in milliseconds. |
| 61 | + - Where captured: |
| 62 | + - Emission point (P2P): `p2p/kademlia/dht.go::IterateBatchStore(...)` |
| 63 | + - After each node RPC returns, we call `p2pmetrics.RecordStore(taskID, Call{...})`. |
| 64 | + - `taskID` is read from the context via `p2pmetrics.TaskIDFromContext(ctx)`. |
| 65 | + - Grouping: `pkg/p2pmetrics/metrics.go` |
| 66 | + - `StartStoreCapture(taskID)` enables capture; `StopStoreCapture(taskID)` disables it. |
| 67 | + - Calls are grouped by `ip` (fallback to `address`) without further aggregation. |
| 68 | + |
| 69 | +### First‑Pass Success Threshold |
| 70 | + |
| 71 | +- Internal enforcement only: if DHT first‑pass success rate is below 75%, `IterateBatchStore` returns an error. |
| 72 | +- No success rate is emitted in events; only error flow is affected. |
| 73 | +- Code: `p2p/kademlia/dht.go::IterateBatchStore`. |
| 74 | + |
| 75 | +### Scope Limits |
| 76 | + |
| 77 | +- Background worker (which continues storing remaining symbols) is NOT captured — we don’t set a metrics task ID on those paths. |
| 78 | + |
| 79 | +--- |
| 80 | + |
| 81 | +## Download (Retrieve) Event |
| 82 | + |
| 83 | +Event payload shape |
| 84 | + |
| 85 | +```json |
| 86 | +{ |
| 87 | + "retrieve": { |
| 88 | + "found_local": 42, |
| 89 | + "retrieve_ms": 2000, |
| 90 | + "decode_ms": 8000, |
| 91 | + "calls_by_ip": { |
| 92 | + "10.0.0.7": [ |
| 93 | + {"ip": "10.0.0.7", "address": "B:4445", "keys": 13, "success": true, "duration_ms": 90} |
| 94 | + ] |
| 95 | + } |
| 96 | + } |
| 97 | +} |
| 98 | +``` |
| 99 | + |
| 100 | +### Fields |
| 101 | + |
| 102 | +- `retrieve.found_local` |
| 103 | + - Meaning: Number of items retrieved from local storage before any network calls. |
| 104 | + - Where captured: `p2p/kademlia/dht.go::BatchRetrieve(...)` |
| 105 | + - After `fetchAndAddLocalKeys`, we call `p2pmetrics.ReportFoundLocal(taskID, int(foundLocalCount))`. |
| 106 | + - `taskID` is read from context with `p2pmetrics.TaskIDFromContext(ctx)`. |
| 107 | + |
| 108 | +- `retrieve.retrieve_ms` |
| 109 | + - Meaning: Time spent in network batch‑retrieve. |
| 110 | + - Where captured: `supernode/services/cascade/download.go` |
| 111 | + - Timestamp before `BatchRetrieve`, measured after it returns. |
| 112 | + |
| 113 | +- `retrieve.decode_ms` |
| 114 | + - Meaning: Time spent decoding symbols and reconstructing the file. |
| 115 | + - Where captured: `supernode/services/cascade/download.go` |
| 116 | + - Timestamp before decode, measured after it returns. |
| 117 | + |
| 118 | +- `retrieve.calls_by_ip` |
| 119 | + - Meaning: All raw per‑RPC retrieve attempts grouped by node IP. |
| 120 | + - Each array entry is a single RPC attempt with: |
| 121 | + - `ip`, `address` — Identifiers as available. |
| 122 | + - `keys` — Number of symbols returned by that node in that call. |
| 123 | + - `success` — True if `keys > 0`. |
| 124 | + - `error` — Error string when the RPC failed; omitted otherwise. |
| 125 | + - `duration_ms` — RPC duration in milliseconds. |
| 126 | + - Where captured: |
| 127 | + - Emission point (P2P): `p2p/kademlia/dht.go::iterateBatchGetValues(...)` |
| 128 | + - Each node RPC records a `p2pmetrics.RecordRetrieve(taskID, Call{...})`. |
| 129 | + - `taskID` is extracted from context using `p2pmetrics.TaskIDFromContext(ctx)`. |
| 130 | + - Grouping: `pkg/p2pmetrics/metrics.go` (same grouping/fallback as store). |
| 131 | + |
| 132 | +### Scope Limits |
| 133 | + |
| 134 | +- Metrics are captured only for the active Download call (context is tagged in `download.go`). |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +## Context Tagging (Task ID) |
| 139 | + |
| 140 | +- We use an explicit, metrics‑only context key defined in `pkg/p2pmetrics` to tag P2P calls with a task ID. |
| 141 | + - Setters: `p2pmetrics.WithTaskID(ctx, id)`. |
| 142 | + - Getters: `p2pmetrics.TaskIDFromContext(ctx)`. |
| 143 | +- Where it is set: |
| 144 | + - Store (first pass): `supernode/services/cascade/adaptors/p2p.go` wraps `StoreBatch` calls. |
| 145 | + - Download: `supernode/services/cascade/download.go` wraps `BatchRetrieve` call. |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +## Building and Emitting Events |
| 150 | + |
| 151 | +- Store |
| 152 | + - `supernode/services/cascade/helper.go::emitArtefactsStored(...)` |
| 153 | + - Builds `store` payload via `p2pmetrics.BuildStoreEventPayloadFromCollector(taskID)`. |
| 154 | + - Emits the event. |
| 155 | + |
| 156 | +- Download |
| 157 | + - `supernode/services/cascade/download.go` |
| 158 | + - Builds `retrieve` payload via `p2pmetrics.BuildDownloadEventPayloadFromCollector(actionID)`. |
| 159 | + - Emits the event. |
| 160 | + |
| 161 | +--- |
| 162 | + |
| 163 | +## Quick File Map |
| 164 | + |
| 165 | +- Capture + grouping: `pkg/p2pmetrics/metrics.go` |
| 166 | +- Store adaptor: `supernode/services/cascade/adaptors/p2p.go` |
| 167 | +- Store event: `supernode/services/cascade/helper.go` |
| 168 | +- Download flow: `supernode/services/cascade/download.go` |
| 169 | +- DHT store calls: `p2p/kademlia/dht.go::IterateBatchStore` |
| 170 | +- DHT retrieve calls: `p2p/kademlia/dht.go::BatchRetrieve` and `iterateBatchGetValues` |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +## Notes |
| 175 | + |
| 176 | +- No P2P stats/snapshots are used to build events. |
| 177 | +- No aggregation is performed; we only group raw RPC attempts by IP. |
| 178 | +- First‑pass success rate is enforced internally (75% threshold) but not emitted as a metric. |
0 commit comments