Skip to content

feat(grafana): refresh Walrus storage node dashboard#5

Open
i-love-to-code wants to merge 1 commit intobartosian:mainfrom
veera-dao:feat/update-storage-dashboard
Open

feat(grafana): refresh Walrus storage node dashboard#5
i-love-to-code wants to merge 1 commit intobartosian:mainfrom
veera-dao:feat/update-storage-dashboard

Conversation

@i-love-to-code
Copy link
Copy Markdown
Contributor

This PR updates the Walrus storage node Grafana dashboard so it matches current node metrics and fixes a broken panel. Includes updated screenshot for docs.

Fixes

  • HTTP Request Duration: Switched to http_server_request_duration_seconds_* (replacing non-existent http_request_duration_seconds_*) and fixed legend to use real labels (http_request_method, http_route, http_response_status_code) so the legend no longer shows “Name:”.

New panels

  • Overview: Shards Owned, Runtime Lag, Checkpoint Parsing Errors.
  • Garbage collection: GC last completed epoch, expired blobs deleted, blob data deletion attempts (by status).
  • Recovery: Ongoing Blob Syncs, Recover Blob Progress, Sync Metadata Progress; Blob Recovery Backlog timeline set to full width.
  • Sui RPC & backlogs: Sui RPC call duration (time series), Sui RPC retries (Primary / Failover), Pending Metadata Cache, Pending Sliver Cache.
  • RocksDB: Background errors, Compaction pending, Block cache usage (%).

Descriptions and clarity

  • Runtime vs checkpoint lag: Panel descriptions updated to spell out that checkpoint lag is from the checkpoint downloader (ingestion vs full node) and runtime lag is from runtime monitoring (health/slashing), and why they can differ.
  • Checkpoint Lag Over Time: Description updated to state it’s the same metric as the Checkpoint Lag stat and how to interpret spikes.
  • Sui RPC Retries: Description explains Primary vs Failover; display names set to “Primary” and “Failover”; threshold-based coloring removed.
  • Latency Distribution: Title set to “Latency Distribution (95th percentile)”; legend shortened to span name only.
  • RocksDB Compaction Pending: Threshold-based coloring removed.
  • Block Read Rate / Get Read Bytes: Descriptions updated to note they are counters, rate in MiB/s, and that the “metric might not be a counter” warning can be ignored when the scrape exposes counters.

Align dashboard with current Prometheus metrics:
* Fix broken HTTP panel.
* Add GC, recovery, Sui RPC, and RocksDB panels.
* Improve descriptions and legends.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant