Overview
Integrate comprehensive monitoring using existing Prometheus/Grafana stack for the DNS resolver system.
Parent Issue: #1093
Tasks
Metrics to Implement
Unbound Metrics
- Query rate (queries/second)
- Cache hit rate (percentage)
- Query latency (histogram)
- DNSSEC validation rate
- Error rates by type
DNSCrypt-proxy Metrics
- Upstream resolver health
- Relay connection status
- DoH request rate
- Certificate rotation events
- Anonymization effectiveness
System Metrics
- Pod CPU/memory usage
- PVC usage (cache size)
- Network throughput
- Service availability
Grafana Dashboards
DNS Overview Dashboard
- Total query rate
- Cache hit rate gauge
- Top queried domains
- Query type distribution
- Error rate trends
Performance Dashboard
- Query latency percentiles
- Cache performance metrics
- Upstream response times
- Resource utilization
Health Dashboard
- Upstream availability
- Relay connection status
- Certificate expiry countdown
- Pod health status
Alert Rules
Critical Alerts
- DNS service down
- Query failure rate >5%
- Certificate expiring <7 days
- No healthy upstreams
Warning Alerts
- Cache hit rate <60%
- High query latency (>100ms p95)
- PVC usage >80%
- Upstream degraded
Files to Create
shared/app/monitoring/
├── dashboards/
│ ├── dns-overview.json
│ ├── dns-performance.json
│ └── dns-health.json
├── prometheusrule.yaml
├── configmap-dashboards.yaml
└── slo.yaml
Acceptance Criteria
Overview
Integrate comprehensive monitoring using existing Prometheus/Grafana stack for the DNS resolver system.
Parent Issue: #1093
Tasks
Metrics to Implement
Unbound Metrics
DNSCrypt-proxy Metrics
System Metrics
Grafana Dashboards
DNS Overview Dashboard
Performance Dashboard
Health Dashboard
Alert Rules
Critical Alerts
Warning Alerts
Files to Create
Acceptance Criteria