Skip to content

Phase 6: Monitoring & Observability #1099

@sob

Description

@sob

Overview

Integrate comprehensive monitoring using existing Prometheus/Grafana stack for the DNS resolver system.

Parent Issue: #1093

Tasks

  • Create Grafana dashboard for DNS overview
  • Create DNS performance dashboard
  • Create upstream health dashboard
  • Set up PrometheusRule for alerts
  • Configure dashboard auto-provisioning
  • Create runbook documentation
  • Set up SLO tracking
  • Configure log aggregation

Metrics to Implement

Unbound Metrics

  • Query rate (queries/second)
  • Cache hit rate (percentage)
  • Query latency (histogram)
  • DNSSEC validation rate
  • Error rates by type

DNSCrypt-proxy Metrics

  • Upstream resolver health
  • Relay connection status
  • DoH request rate
  • Certificate rotation events
  • Anonymization effectiveness

System Metrics

  • Pod CPU/memory usage
  • PVC usage (cache size)
  • Network throughput
  • Service availability

Grafana Dashboards

DNS Overview Dashboard

  • Total query rate
  • Cache hit rate gauge
  • Top queried domains
  • Query type distribution
  • Error rate trends

Performance Dashboard

  • Query latency percentiles
  • Cache performance metrics
  • Upstream response times
  • Resource utilization

Health Dashboard

  • Upstream availability
  • Relay connection status
  • Certificate expiry countdown
  • Pod health status

Alert Rules

Critical Alerts

  • DNS service down
  • Query failure rate >5%
  • Certificate expiring <7 days
  • No healthy upstreams

Warning Alerts

  • Cache hit rate <60%
  • High query latency (>100ms p95)
  • PVC usage >80%
  • Upstream degraded

Files to Create

shared/app/monitoring/
├── dashboards/
│   ├── dns-overview.json
│   ├── dns-performance.json
│   └── dns-health.json
├── prometheusrule.yaml
├── configmap-dashboards.yaml
└── slo.yaml

Acceptance Criteria

  • All dashboards loading in Grafana
  • Metrics being collected and stored
  • Alerts firing correctly in test scenarios
  • SLOs defined and tracked
  • Runbook accessible and complete
  • Historical data retention working
  • Dashboard variables functioning

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions