-
Notifications
You must be signed in to change notification settings - Fork 0
Observability
Version: 1.0 Last Updated: 2026-01-22
This guide covers monitoring, metrics, tracing, and debugging for Grimnir Radio in production environments.
Grimnir Radio provides comprehensive observability through:
- Prometheus Metrics - Real-time performance and health metrics
- OpenTelemetry Tracing - End-to-end request tracing
- Structured Logging - JSON logs with context and correlation IDs
- Event Bus - Real-time event streaming for monitoring
┌─────────────────────────────────────────────────────────────────┐
│ Grimnir Radio Instances │
│ [API] → [Scheduler] → [Executor] → [Media Engine] │
│ ↓ ↓ ↓ ↓ │
│ Metrics Metrics Metrics Metrics │
│ Traces Traces Traces Traces │
│ Logs Logs Logs Logs │
└────┬──────────┬──────────┬─────────────┬─────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────────────────────────────────────────────────────────────┐
│ Observability Stack │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ Prometheus │ │ Jaeger/Tempo │ │ Loki/ELK │ │
│ │ (Metrics) │ │ (Traces) │ │ (Logs) │ │
│ └──────┬──────┘ └──────┬───────┘ └─────┬──────────┘ │
│ │ │ │ │
│ └────────────────┴─────────────────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ Grafana │ │
│ │ (Dashboard)│ │
│ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘
All Grimnir Radio instances expose Prometheus metrics at:
GET http://localhost:8080/metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
grimnir_schedule_build_duration_seconds |
Histogram | station_id |
Time to generate schedule |
grimnir_schedule_entries_total |
Gauge | station_id |
Number of schedule entries |
grimnir_smart_block_materialize_duration_seconds |
Histogram |
station_id, smart_block_id
|
Smart block generation time |
grimnir_scheduler_ticks_total |
Counter | - | Total scheduler ticks |
grimnir_scheduler_errors_total |
Counter |
station_id, error_type
|
Scheduler errors |
Example Query:
# Average schedule build time over 5 minutes
rate(grimnir_schedule_build_duration_seconds_sum[5m]) /
rate(grimnir_schedule_build_duration_seconds_count[5m])
| Metric | Type | Labels | Description |
|---|---|---|---|
grimnir_executor_state |
Gauge |
station_id, executor_id
|
Current state (0-5) |
grimnir_playout_buffer_depth_samples |
Gauge |
station_id, mount_id
|
Buffer depth in samples |
grimnir_playout_dropout_count_total |
Counter |
station_id, mount_id
|
Underrun count |
grimnir_playout_cpu_usage_percent |
Gauge |
station_id, mount_id
|
CPU usage |
grimnir_executor_state_transitions_total |
Counter |
station_id, from_state, to_state
|
State changes |
grimnir_executor_priority_changes_total |
Counter |
station_id, from_priority, to_priority
|
Priority changes |
Executor States:
-
0= Idle -
1= Preloading -
2= Playing -
3= Fading -
4= Live -
5= Emergency
Example Query:
# Dropout rate per minute
rate(grimnir_playout_dropout_count_total[1m]) * 60
| Metric | Type | Labels | Description |
|---|---|---|---|
grimnir_media_engine_loudness_lufs |
Gauge |
station_id, mount_id
|
Current LUFS level |
grimnir_media_engine_output_health |
Gauge |
station_id, mount_id, output_type
|
Output status (0/1) |
grimnir_media_engine_connection_status |
Gauge | executor_id |
gRPC connection status |
grimnir_media_engine_pipeline_restarts_total |
Counter |
station_id, mount_id, reason
|
Pipeline restart count |
grimnir_media_engine_audio_level_left_db |
Gauge |
station_id, mount_id
|
Left channel level (dB) |
grimnir_media_engine_audio_level_right_db |
Gauge |
station_id, mount_id
|
Right channel level (dB) |
grimnir_media_engine_operations_total |
Counter |
station_id, mount_id, operation, status
|
Operation counts |
grimnir_media_engine_operation_duration_seconds |
Histogram |
station_id, mount_id, operation
|
Operation latency |
grimnir_media_engine_playback_state |
Gauge |
station_id, mount_id
|
Playback state (0-6) |
grimnir_media_engine_active_pipelines |
Gauge |
station_id, mount_id
|
Active pipeline count |
Example Query:
# Average loudness across all stations
avg(grimnir_media_engine_loudness_lufs)
| Metric | Type | Labels | Description |
|---|---|---|---|
grimnir_api_request_duration_seconds |
Histogram |
method, endpoint, status_code
|
Request latency |
grimnir_api_requests_total |
Counter |
method, endpoint, status_code
|
Request count |
grimnir_api_active_connections |
Gauge | - | Active HTTP connections |
grimnir_api_websocket_connections |
Gauge | - | Active WebSocket connections |
Example Query:
# 95th percentile API latency
histogram_quantile(0.95,
rate(grimnir_api_request_duration_seconds_bucket[5m]))
| Metric | Type | Labels | Description |
|---|---|---|---|
grimnir_live_sessions_active |
Gauge | station_id |
Active DJ sessions |
grimnir_live_session_duration_seconds |
Histogram |
station_id, user_id
|
Session duration |
grimnir_webstream_health_status |
Gauge |
webstream_id, station_id
|
Health status (0-2) |
grimnir_webstream_failovers_total |
Counter |
webstream_id, station_id, from_url, to_url
|
Failover count |
grimnir_webstream_health_checks_total |
Counter |
webstream_id, status
|
Health check count |
Example Query:
# Webstream failover rate
rate(grimnir_webstream_failovers_total[5m])
| Metric | Type | Labels | Description |
|---|---|---|---|
grimnir_database_query_duration_seconds |
Histogram |
operation, table
|
Query latency |
grimnir_database_connections_active |
Gauge | - | Active connections |
grimnir_database_errors_total |
Counter |
operation, error_type
|
Database errors |
Example Query:
# Slow database queries (>100ms)
histogram_quantile(0.95,
rate(grimnir_database_query_duration_seconds_bucket[5m])) > 0.1
| Metric | Type | Labels | Description |
|---|---|---|---|
grimnir_leader_election_status |
Gauge | instance_id |
Leadership status (0/1) |
grimnir_leader_election_changes_total |
Counter |
instance_id, event
|
Leadership changes |
Example Query:
# Current leader
grimnir_leader_election_status == 1
Grimnir Radio uses OpenTelemetry for distributed tracing across all components.
Set these environment variables:
# Enable tracing
GRIMNIR_TRACING_ENABLED=true
# OTLP endpoint (Jaeger, Tempo, etc.)
GRIMNIR_OTLP_ENDPOINT=localhost:4317
# Sample rate (0.0 to 1.0)
GRIMNIR_TRACING_SAMPLE_RATE=1.0 # 100% sampling for development
GRIMNIR_TRACING_SAMPLE_RATE=0.1 # 10% sampling for productionTraces propagate through:
- HTTP API → Scheduler → Executor → Media Engine
- HTTP API → Database queries
- HTTP API → Event Bus events
# Start Jaeger all-in-one
docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one:latest
# Open Jaeger UI
open http://localhost:16686# tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /tmp/tempo/blocks# Start Tempo
docker run -d --name tempo \
-p 3200:3200 \
-p 4317:4317 \
-v $(pwd)/tempo.yaml:/etc/tempo/tempo.yaml \
grafana/tempo:latest \
-config.file=/etc/tempo/tempo.yaml
# Query via GrafanaHTTP POST /api/v1/schedule/generate
├─ scheduler.scheduleStation (5ms)
│ ├─ database.query: clocks (2ms)
│ ├─ smartblock.materialize (3ms)
│ │ └─ database.query: media_items (2ms)
│ └─ database.insert: schedule_entries (1ms)
└─ executor.notifyScheduleUpdate (1ms)
executor.playTrack
├─ database.query: media_item (2ms)
├─ media_engine.Play (gRPC) (50ms)
│ ├─ pipeline.loadGraph (10ms)
│ ├─ pipeline.startPlayback (40ms)
│ └─ telemetry.startStream (1ms)
└─ events.publish: now_playing (1ms)
# Start Grafana
docker run -d --name grafana \
-p 3000:3000 \
grafana/grafana:latest
# Add Prometheus data source
# URL: http://prometheus:9090Panels:
-
Request Rate -
rate(grimnir_api_requests_total[1m]) -
Error Rate -
rate(grimnir_api_requests_total{status_code=~"5.."}[1m]) -
P95 Latency -
histogram_quantile(0.95, rate(grimnir_api_request_duration_seconds_bucket[5m])) -
Active Connections -
grimnir_api_active_connections -
WebSocket Connections -
grimnir_api_websocket_connections
Panels:
-
Schedule Build Time -
rate(grimnir_schedule_build_duration_seconds_sum[5m]) / rate(grimnir_schedule_build_duration_seconds_count[5m]) -
Schedule Entries -
sum(grimnir_schedule_entries_total) -
Scheduler Ticks -
rate(grimnir_scheduler_ticks_total[1m]) -
Scheduler Errors -
rate(grimnir_scheduler_errors_total[1m])
Panels:
-
Executor States (gauge) -
grimnir_executor_state -
Buffer Depth -
grimnir_playout_buffer_depth_samples -
Dropout Rate -
rate(grimnir_playout_dropout_count_total[1m]) -
State Transitions -
rate(grimnir_executor_state_transitions_total[1m]) -
Priority Changes -
rate(grimnir_executor_priority_changes_total[1m])
Panels:
-
Loudness Levels -
grimnir_media_engine_loudness_lufs -
Audio Levels L/R -
grimnir_media_engine_audio_level_left_db,grimnir_media_engine_audio_level_right_db -
Pipeline Restarts -
rate(grimnir_media_engine_pipeline_restarts_total[5m]) -
Operation Latency -
rate(grimnir_media_engine_operation_duration_seconds_sum[5m]) / rate(grimnir_media_engine_operation_duration_seconds_count[5m]) -
Connection Status -
grimnir_media_engine_connection_status
Panels:
-
Query Latency -
histogram_quantile(0.95, rate(grimnir_database_query_duration_seconds_bucket[5m])) -
Active Connections -
grimnir_database_connections_active -
Query Rate -
rate(grimnir_database_query_duration_seconds_count[1m]) -
Error Rate -
rate(grimnir_database_errors_total[1m])
# /etc/prometheus/rules/grimnir.yml
groups:
- name: grimnir_critical
interval: 30s
rules:
- alert: MediaEngineDown
expr: grimnir_media_engine_connection_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Media engine disconnected for {{ $labels.executor_id }}"
description: "Media engine gRPC connection down for 1 minute"
- alert: HighDropoutRate
expr: rate(grimnir_playout_dropout_count_total[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High dropout rate on {{ $labels.station_id }}"
description: "Dropout rate: {{ $value | humanize }} per second"
- alert: ScheduleGap
expr: grimnir_schedule_entries_total < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low schedule entries for {{ $labels.station_id }}"
description: "Only {{ $value }} schedule entries remaining"
- alert: HighAPILatency
expr: histogram_quantile(0.95, rate(grimnir_api_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High API latency"
description: "P95 latency: {{ $value }}s (threshold: 1s)"
- alert: LeaderElectionFailure
expr: sum(grimnir_leader_election_status) != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Leader election problem"
description: "{{ $value }} leaders elected (expected 1)"
- alert: DatabaseConnectionPoolExhausted
expr: grimnir_database_connections_active > 45
for: 2m
labels:
severity: warning
annotations:
summary: "Database connection pool nearly full"
description: "{{ $value }}/50 connections used"
- name: grimnir_performance
interval: 1m
rules:
- alert: SlowDatabaseQueries
expr: histogram_quantile(0.95, rate(grimnir_database_query_duration_seconds_bucket[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Slow database queries"
description: "P95 query time: {{ $value }}s"
- alert: WebstreamFailover
expr: rate(grimnir_webstream_failovers_total[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Webstream failover for {{ $labels.webstream_id }}"
description: "Failover from {{ $labels.from_url }} to {{ $labels.to_url }}"# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'station_id']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'grimnir-team'
receivers:
- name: 'grimnir-team'
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#grimnir-alerts'
title: 'Grimnir Radio Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'Check:
rate(process_cpu_seconds_total[1m]) * 100
Common Causes:
- Too many concurrent schedule builds
- Media engine pipeline overload
- Database query performance issues
Solutions:
- Reduce scheduler tick frequency
- Add database indexes (see DATABASE_OPTIMIZATION.md)
- Increase media engine resources
- Enable query caching
Check:
process_resident_memory_bytes
Common Causes:
- Unclosed database connections
- Event bus subscriber leaks
- GStreamer pipeline cleanup issues
Solutions:
- Review connection pool settings
- Check event bus subscription cleanup
- Monitor executor lifecycle
- Restart affected instances
Check:
grimnir_schedule_entries_total < 20
Common Causes:
- Scheduler not running (leader election issue)
- Smart block materialization failures
- Insufficient media library
Solutions:
- Check leader election status
- Review scheduler logs for errors
- Verify media library has sufficient items
- Check clock hour configuration
Check:
rate(grimnir_playout_dropout_count_total[1m])
Common Causes:
- Network latency to media engine
- Media engine CPU overload
- Disk I/O bottleneck
- Buffer depth too low
Solutions:
- Check media engine system resources
- Increase buffer size in DSP config
- Use faster storage (SSD) for media files
- Reduce concurrent pipeline count
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
storage:
tsdb:
retention.time: 30d
retention.size: 50GBUse structured logging with correlation IDs:
{
"level": "info",
"time": "2026-01-22T10:00:00Z",
"request_id": "abc123",
"station_id": "station-1",
"message": "schedule generated",
"entries": 48
}Development: 100% tracing
GRIMNIR_TRACING_SAMPLE_RATE=1.0Production: 10% tracing
GRIMNIR_TRACING_SAMPLE_RATE=0.1High-traffic: 1% tracing
GRIMNIR_TRACING_SAMPLE_RATE=0.01Create separate dashboards for:
- Operations - System health and performance
- Development - Detailed metrics for debugging
- Business - Station uptime, listener stats
Version: 1.0 Last Updated: 2026-01-22
Getting Started
Core Concepts
Scheduling
Deployment
Integration
Operations
Development
Reference