Add observability stack (Prometheus, Grafana, Loki)

Add a basic observability stack to the homelab on a dedicated VM, rolled out in phases.

## Architecture

- Dedicated `observability` VM (Tofu-managed, Debian Bookworm, docker compose stack)
- Separate from the `containers` VM to isolate production services from monitoring infra
- Same patterns: Tofu for provisioning, Ansible role for config, docker compose for services

## Phase 1: Metrics (start here)

Deploy on the observability VM:
- Prometheus — scrape and store metrics
- Grafana — dashboards
- node_exporter — host metrics (on each Linux VM)
- cAdvisor — container metrics (on Docker hosts)
- blackbox_exporter (optional) — HTTP/DNS/TCP/ICMP probes

First dashboards:
- Host overview (CPU/memory/disk/network)
- Docker container overview (CPU/memory/restarts)
- Service uptime / latency
- TLS cert expiry

Infrastructure work:
- [ ] Tofu: define observability VM (static IP in 10.10.15.10-99 range)
- [ ] DHCP/DNS: reserve IP, add `observability.lan.quietlife.net`
- [ ] Ansible: new role for the docker compose stack
- [ ] Traefik: route to Grafana
- [ ] Deploy node_exporter on other Linux VMs
- [ ] Prometheus scrape configs for all targets

## Phase 2: Logs

- Loki — log aggregation
- Grafana Alloy — log shipping (OTel-compatible for future flexibility)
- Collect: system logs, container logs, selected app logs

## Phase 3: Alerts

- Alertmanager — routing/dedup/silences
- Start with a small set of useful alerts:
  - Host down for 5m
  - Disk > 85% on important volumes
  - Repeated container restarts
  - Backup stale
  - Cert expiring soon
  - External probe failing

## Phase 4: Traces (later, optional)

- Tempo + OpenTelemetry Collector or Alloy
- Only if useful for learning or instrumenting custom apps

## Notes

- **OpenBSD firewall**: No official node_exporter for OpenBSD. Monitor externally via blackbox_exporter probes instead.
- **Storage**: Local disk with short retention (15-30 days) rather than NFS — simpler and sufficient for learning.
- **Alloy vs Promtail**: Alloy is the forward-looking choice (Promtail is deprecated), but has a steeper config learning curve. Decide in Phase 2.

## Non-goals
- Full tracing everywhere
- Giant multi-backend comparison
- Ingesting every noisy log source
- Self-hosted Splunk


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add observability stack (Prometheus, Grafana, Loki) #182

Architecture

Phase 1: Metrics (start here)

Phase 2: Logs

Phase 3: Alerts

Phase 4: Traces (later, optional)

Notes

Non-goals

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add observability stack (Prometheus, Grafana, Loki) #182

Description

Architecture

Phase 1: Metrics (start here)

Phase 2: Logs

Phase 3: Alerts

Phase 4: Traces (later, optional)

Notes

Non-goals

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions