Add a basic observability stack to the homelab on a dedicated VM, rolled out in phases.
Architecture
- Dedicated
observability VM (Tofu-managed, Debian Bookworm, docker compose stack)
- Separate from the
containers VM to isolate production services from monitoring infra
- Same patterns: Tofu for provisioning, Ansible role for config, docker compose for services
Phase 1: Metrics (start here)
Deploy on the observability VM:
- Prometheus — scrape and store metrics
- Grafana — dashboards
- node_exporter — host metrics (on each Linux VM)
- cAdvisor — container metrics (on Docker hosts)
- blackbox_exporter (optional) — HTTP/DNS/TCP/ICMP probes
First dashboards:
- Host overview (CPU/memory/disk/network)
- Docker container overview (CPU/memory/restarts)
- Service uptime / latency
- TLS cert expiry
Infrastructure work:
Phase 2: Logs
- Loki — log aggregation
- Grafana Alloy — log shipping (OTel-compatible for future flexibility)
- Collect: system logs, container logs, selected app logs
Phase 3: Alerts
- Alertmanager — routing/dedup/silences
- Start with a small set of useful alerts:
- Host down for 5m
- Disk > 85% on important volumes
- Repeated container restarts
- Backup stale
- Cert expiring soon
- External probe failing
Phase 4: Traces (later, optional)
- Tempo + OpenTelemetry Collector or Alloy
- Only if useful for learning or instrumenting custom apps
Notes
- OpenBSD firewall: No official node_exporter for OpenBSD. Monitor externally via blackbox_exporter probes instead.
- Storage: Local disk with short retention (15-30 days) rather than NFS — simpler and sufficient for learning.
- Alloy vs Promtail: Alloy is the forward-looking choice (Promtail is deprecated), but has a steeper config learning curve. Decide in Phase 2.
Non-goals
- Full tracing everywhere
- Giant multi-backend comparison
- Ingesting every noisy log source
- Self-hosted Splunk
Add a basic observability stack to the homelab on a dedicated VM, rolled out in phases.
Architecture
observabilityVM (Tofu-managed, Debian Bookworm, docker compose stack)containersVM to isolate production services from monitoring infraPhase 1: Metrics (start here)
Deploy on the observability VM:
First dashboards:
Infrastructure work:
observability.lan.quietlife.netPhase 2: Logs
Phase 3: Alerts
Phase 4: Traces (later, optional)
Notes
Non-goals