Skip to content

Commit 51317e6

Browse files
feat: Enhance call metrics and documentation
1 parent bab32bc commit 51317e6

File tree

22 files changed

+909
-491
lines changed

22 files changed

+909
-491
lines changed

docs/p2p-metrics-capture.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# P2P Metrics Capture — What Each Field Means and Where It’s Collected
2+
3+
This guide explains every field we emit in Cascade events, how it is measured, and exactly where it is captured in the code.
4+
5+
The design is minimal by intent:
6+
- Metrics are collected only for the first pass of Register (store) and for the active Download operation.
7+
- P2P APIs return errors only; per‑RPC details are captured via a small metrics package (`pkg/p2pmetrics`).
8+
- No aggregation; we only group raw RPC attempts by IP.
9+
10+
---
11+
12+
## Store (Register) Event
13+
14+
Event payload shape
15+
16+
```json
17+
{
18+
"store": {
19+
"duration_ms": 9876,
20+
"symbols_first_pass": 220,
21+
"symbols_total": 1200,
22+
"id_files_count": 14,
23+
"calls_by_ip": {
24+
"10.0.0.5": [
25+
{"ip": "10.0.0.5", "address": "A:4445", "keys": 100, "success": true, "duration_ms": 120},
26+
{"ip": "10.0.0.5", "address": "A:4445", "keys": 120, "success": false, "error": "timeout", "duration_ms": 300}
27+
]
28+
}
29+
}
30+
}
31+
```
32+
33+
### Fields
34+
35+
- `store.duration_ms`
36+
37+
- `store.symbols_first_pass`
38+
- Meaning: Number of symbols sent during the Register first pass (across the combined first batch and any immediate first‑pass symbol batches).
39+
- Where captured: `supernode/services/cascade/adaptors/p2p.go` via `p2pmetrics.SetStoreSummary(...)` using the value returned by `storeCascadeSymbolsAndData`.
40+
41+
- `store.symbols_total`
42+
- Meaning: Total symbols available in the symbol directory (before sampling). Used to contextualize the first‑pass coverage.
43+
- Where captured: Computed in `storeCascadeSymbolsAndData` and included in `SetStoreSummary`.
44+
45+
- `store.id_files_count`
46+
- Meaning: Number of redundant metadata files (ID files) sent in the first combined batch.
47+
- Where captured: `len(req.IDFiles)` in `StoreArtefacts`, passed to `SetStoreSummary`.
48+
- Meaning: End‑to‑end elapsed time of the first‑pass store phase (Register’s storage section only).
49+
- Where captured: `supernode/services/cascade/adaptors/p2p.go`
50+
- A `time.Now()` timestamp is taken just before the first‑pass store function and measured on return.
51+
52+
- `store.calls_by_ip`
53+
- Meaning: All raw network store RPC attempts grouped by the node IP.
54+
- Each array entry is a single RPC attempt with:
55+
- `ip` — Node IP (fallback to `address` if missing).
56+
- `address` — Node string `IP:port`.
57+
- `keys` — Number of items in that RPC attempt (metadata + first symbols for the first combined batch, symbols for subsequent batches within the first pass).
58+
- `success` — True if the node acknowledged the store successfully.
59+
- `error` — Any error string captured; omitted when success.
60+
- `duration_ms` — RPC duration in milliseconds.
61+
- Where captured:
62+
- Emission point (P2P): `p2p/kademlia/dht.go::IterateBatchStore(...)`
63+
- After each node RPC returns, we call `p2pmetrics.RecordStore(taskID, Call{...})`.
64+
- `taskID` is read from the context via `p2pmetrics.TaskIDFromContext(ctx)`.
65+
- Grouping: `pkg/p2pmetrics/metrics.go`
66+
- `StartStoreCapture(taskID)` enables capture; `StopStoreCapture(taskID)` disables it.
67+
- Calls are grouped by `ip` (fallback to `address`) without further aggregation.
68+
69+
### First‑Pass Success Threshold
70+
71+
- Internal enforcement only: if DHT first‑pass success rate is below 75%, `IterateBatchStore` returns an error.
72+
- No success rate is emitted in events; only error flow is affected.
73+
- Code: `p2p/kademlia/dht.go::IterateBatchStore`.
74+
75+
### Scope Limits
76+
77+
- Background worker (which continues storing remaining symbols) is NOT captured — we don’t set a metrics task ID on those paths.
78+
79+
---
80+
81+
## Download (Retrieve) Event
82+
83+
Event payload shape
84+
85+
```json
86+
{
87+
"retrieve": {
88+
"found_local": 42,
89+
"retrieve_ms": 2000,
90+
"decode_ms": 8000,
91+
"calls_by_ip": {
92+
"10.0.0.7": [
93+
{"ip": "10.0.0.7", "address": "B:4445", "keys": 13, "success": true, "duration_ms": 90}
94+
]
95+
}
96+
}
97+
}
98+
```
99+
100+
### Fields
101+
102+
- `retrieve.found_local`
103+
- Meaning: Number of items retrieved from local storage before any network calls.
104+
- Where captured: `p2p/kademlia/dht.go::BatchRetrieve(...)`
105+
- After `fetchAndAddLocalKeys`, we call `p2pmetrics.ReportFoundLocal(taskID, int(foundLocalCount))`.
106+
- `taskID` is read from context with `p2pmetrics.TaskIDFromContext(ctx)`.
107+
108+
- `retrieve.retrieve_ms`
109+
- Meaning: Time spent in network batch‑retrieve.
110+
- Where captured: `supernode/services/cascade/download.go`
111+
- Timestamp before `BatchRetrieve`, measured after it returns.
112+
113+
- `retrieve.decode_ms`
114+
- Meaning: Time spent decoding symbols and reconstructing the file.
115+
- Where captured: `supernode/services/cascade/download.go`
116+
- Timestamp before decode, measured after it returns.
117+
118+
- `retrieve.calls_by_ip`
119+
- Meaning: All raw per‑RPC retrieve attempts grouped by node IP.
120+
- Each array entry is a single RPC attempt with:
121+
- `ip`, `address` — Identifiers as available.
122+
- `keys` — Number of symbols returned by that node in that call.
123+
- `success` — True if `keys > 0`.
124+
- `error` — Error string when the RPC failed; omitted otherwise.
125+
- `duration_ms` — RPC duration in milliseconds.
126+
- Where captured:
127+
- Emission point (P2P): `p2p/kademlia/dht.go::iterateBatchGetValues(...)`
128+
- Each node RPC records a `p2pmetrics.RecordRetrieve(taskID, Call{...})`.
129+
- `taskID` is extracted from context using `p2pmetrics.TaskIDFromContext(ctx)`.
130+
- Grouping: `pkg/p2pmetrics/metrics.go` (same grouping/fallback as store).
131+
132+
### Scope Limits
133+
134+
- Metrics are captured only for the active Download call (context is tagged in `download.go`).
135+
136+
---
137+
138+
## Context Tagging (Task ID)
139+
140+
- We use an explicit, metrics‑only context key defined in `pkg/p2pmetrics` to tag P2P calls with a task ID.
141+
- Setters: `p2pmetrics.WithTaskID(ctx, id)`.
142+
- Getters: `p2pmetrics.TaskIDFromContext(ctx)`.
143+
- Where it is set:
144+
- Store (first pass): `supernode/services/cascade/adaptors/p2p.go` wraps `StoreBatch` calls.
145+
- Download: `supernode/services/cascade/download.go` wraps `BatchRetrieve` call.
146+
147+
---
148+
149+
## Building and Emitting Events
150+
151+
- Store
152+
- `supernode/services/cascade/helper.go::emitArtefactsStored(...)`
153+
- Builds `store` payload via `p2pmetrics.BuildStoreEventPayloadFromCollector(taskID)`.
154+
- Emits the event.
155+
156+
- Download
157+
- `supernode/services/cascade/download.go`
158+
- Builds `retrieve` payload via `p2pmetrics.BuildDownloadEventPayloadFromCollector(actionID)`.
159+
- Emits the event.
160+
161+
---
162+
163+
## Quick File Map
164+
165+
- Capture + grouping: `pkg/p2pmetrics/metrics.go`
166+
- Store adaptor: `supernode/services/cascade/adaptors/p2p.go`
167+
- Store event: `supernode/services/cascade/helper.go`
168+
- Download flow: `supernode/services/cascade/download.go`
169+
- DHT store calls: `p2p/kademlia/dht.go::IterateBatchStore`
170+
- DHT retrieve calls: `p2p/kademlia/dht.go::BatchRetrieve` and `iterateBatchGetValues`
171+
172+
---
173+
174+
## Notes
175+
176+
- No P2P stats/snapshots are used to build events.
177+
- No aggregation is performed; we only group raw RPC attempts by IP.
178+
- First‑pass success rate is enforced internally (75% threshold) but not emitted as a metric.

p2p/client.go

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -22,17 +22,8 @@ type Client interface {
2222
// - the base58 encoded identifier will be returned
2323
Store(ctx context.Context, data []byte, typ int) (string, error)
2424

25-
// StoreBatch will store a batch of values with their Blake3 hash as the key.
26-
//
27-
// Semantics:
28-
// - Returns `successRatePct` as a percentage (0–100) computed as
29-
// successful node RPCs divided by total node RPCs attempted during the
30-
// network store phase for this batch.
31-
// - Returns `requests` as the total number of node RPCs attempted for this
32-
// batch (not the number of items in `values`).
33-
// - On error, `successRatePct` and `requests` may reflect partial progress;
34-
// prefer using them only when err == nil, or treat as best‑effort metrics.
35-
StoreBatch(ctx context.Context, values [][]byte, typ int, taskID string) (float64, int, error)
25+
// StoreBatch stores a batch of values; returns error only.
26+
StoreBatch(ctx context.Context, values [][]byte, typ int, taskID string) error
3627

3728
// Delete a key, value
3829
Delete(ctx context.Context, key string) error

0 commit comments

Comments
 (0)