Skip to content

Add observability stack (Prometheus, Grafana, Loki) #182

@cwage

Description

@cwage

Add a basic observability stack to the homelab on a dedicated VM, rolled out in phases.

Architecture

  • Dedicated observability VM (Tofu-managed, Debian Bookworm, docker compose stack)
  • Separate from the containers VM to isolate production services from monitoring infra
  • Same patterns: Tofu for provisioning, Ansible role for config, docker compose for services

Phase 1: Metrics (start here)

Deploy on the observability VM:

  • Prometheus — scrape and store metrics
  • Grafana — dashboards
  • node_exporter — host metrics (on each Linux VM)
  • cAdvisor — container metrics (on Docker hosts)
  • blackbox_exporter (optional) — HTTP/DNS/TCP/ICMP probes

First dashboards:

  • Host overview (CPU/memory/disk/network)
  • Docker container overview (CPU/memory/restarts)
  • Service uptime / latency
  • TLS cert expiry

Infrastructure work:

  • Tofu: define observability VM (static IP in 10.10.15.10-99 range)
  • DHCP/DNS: reserve IP, add observability.lan.quietlife.net
  • Ansible: new role for the docker compose stack
  • Traefik: route to Grafana
  • Deploy node_exporter on other Linux VMs
  • Prometheus scrape configs for all targets

Phase 2: Logs

  • Loki — log aggregation
  • Grafana Alloy — log shipping (OTel-compatible for future flexibility)
  • Collect: system logs, container logs, selected app logs

Phase 3: Alerts

  • Alertmanager — routing/dedup/silences
  • Start with a small set of useful alerts:
    • Host down for 5m
    • Disk > 85% on important volumes
    • Repeated container restarts
    • Backup stale
    • Cert expiring soon
    • External probe failing

Phase 4: Traces (later, optional)

  • Tempo + OpenTelemetry Collector or Alloy
  • Only if useful for learning or instrumenting custom apps

Notes

  • OpenBSD firewall: No official node_exporter for OpenBSD. Monitor externally via blackbox_exporter probes instead.
  • Storage: Local disk with short retention (15-30 days) rather than NFS — simpler and sufficient for learning.
  • Alloy vs Promtail: Alloy is the forward-looking choice (Promtail is deprecated), but has a steeper config learning curve. Decide in Phase 2.

Non-goals

  • Full tracing everywhere
  • Giant multi-backend comparison
  • Ingesting every noisy log source
  • Self-hosted Splunk

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions