Skip to content

feat(ansible): pve-host-netmon role — Proxmox network bandwidth monitoring#1383

Open
dark-vex wants to merge 3 commits intomainfrom
feat/pve-host-netmon
Open

feat(ansible): pve-host-netmon role — Proxmox network bandwidth monitoring#1383
dark-vex wants to merge 3 commits intomainfrom
feat/pve-host-netmon

Conversation

@dark-vex
Copy link
Copy Markdown
Owner

@dark-vex dark-vex commented May 7, 2026

Summary

  • Adds new Ansible role ansible/pve-host-netmon/ that installs Grafana Alloy directly on Proxmox hosts to track per-interface network bandwidth
  • Uses Alloy's built-in prometheus.exporter.unix with only the netdev collector — produces ~4–6 timeseries total, negligible Grafana Cloud quota impact
  • Excludes all Proxmox virtual interfaces (vmbr*, veth*, fwbr*, fwln*, fwpr*, tap*) via configurable netmon_device_exclude regex
  • Metrics: node_network_receive_bytes_total + node_network_transmit_bytes_total per physical NIC → enables increase(...[30d]) for monthly totals and alerting against the 25 TB housing limit
  • No Cacti, no Zabbix agent — extends the existing Grafana Cloud stack already used for PVE monitoring

Role structure

File Purpose
defaults/main.yml Alloy version, config dir, scrape interval, device exclude regex
tasks/main.yml Grafana APT repo, Alloy install, config deploy, systemd enable
templates/alloy-config.alloy.j2 Alloy config: netdev-only exporter + remote_write to Grafana Cloud
handlers/main.yml Reload systemd / Restart alloy
inventory.yml pve_hosts group (rabbit-01-psp placeholder)
playbook.yml Production playbook; run with -e @secrets.yml
molecule/default/ Full converge + idempotence + verify test suite

Molecule test results

converge:   12 tasks, 10 changed, 0 failed
idempotence: 11 tasks,  0 changed (fully idempotent)
verify:     17 tasks,  0 failed

Verify checks: GPG keyring, alloy package, config dir/file, set_collectors = ["netdev"], device_exclude, remote_write URL, prometheus.exporter.unix block, CUSTOM_ARGS in /etc/default/alloy, service enabled, pve-exporter absent.

Deployment (rabbit-01-psp)

Before running, fill in real SSH host and IP in inventory.yml, then provide secrets:

ansible-playbook -i ansible/pve-host-netmon/inventory.yml \
  ansible/pve-host-netmon/playbook.yml \
  -e @secrets.yml

secrets.yml must contain:

  • grafana_api_key
  • grafana_metrics_username
  • grafana_prometheus_url

Test plan

  • molecule test passes (converge + idempotence + verify)
  • Deploy to rabbit-01-psp and confirm metrics appear in Grafana Cloud under node_network_receive_bytes_total{site="bgy"}
  • Verify only physical interfaces appear (no vmbr*, veth*, etc.)
  • Build Grafana dashboard with monthly total gauge and 25 TB threshold alert

🤖 Generated with Claude Code

dark-vex and others added 2 commits May 7, 2026 18:37
… monitoring

Installs Grafana Alloy directly on Proxmox hosts (not in LXC) to collect
per-physical-interface byte counters via prometheus.exporter.unix (netdev
collector only). Goal: track monthly bandwidth against a 25 TB housing
limit without adding Cacti or fragmenting monitoring into Zabbix.

Only node_network_receive_bytes_total and node_network_transmit_bytes_total
are emitted (~4-6 timeseries total). All Proxmox virtual interfaces
(vmbr*, veth*, fwbr*, fwln*, fwpr*, tap*) are excluded by regex to stay
within existing Grafana Cloud metrics quota.

Role includes full molecule test suite (converge + idempotence + verify)
mirroring the pve-monitoring role pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a Grafana dashboard (proxmox folder) to track monthly bandwidth
usage on rabbit-01-psp against the 25 TB housing limit (in+out combined,
calendar month from the 1st).

Dashboard panels (default time range: now/M → now i.e. "This month"):
- Gauge: % of 25 TB limit (green/yellow/orange/red thresholds at 70/90/95%)
- Stat: total used this month (decbytes)
- Stat: remaining budget (colour-inverted thresholds)
- Stat: daily average bytes/day via $__range_s
- Time series: per-interface rx/tx rate (binBps) for spike detection

All stat panels use instant queries with $__range so they compute the
exact month-to-date delta rather than a rolling 30-day approximation.

Also fixes the Alloy config template to set instance="rabbit-01-psp" via
prometheus.relabel (otherwise the label would be the internal scrape
address). molecule verify updated to assert the relabel block is present
(18 checks total, all pass).

generate_dashboards.py changes:
- p_stat gains instant=False param (backward-compatible)
- p_gauge added (type: gauge with threshold markers)
- make_dashboard gains time_range/refresh overrides
- stable_uid handles "proxmox" cluster (prefix "pve")
- build_rabbit_netbw() builder added
- APPS["proxmox"] registered; "rabbit-netbw" builder wired in main()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wiz-b661a10a98
Copy link
Copy Markdown

wiz-b661a10a98 Bot commented May 7, 2026

Wiz Scan Summary

Scanner Findings
Vulnerability Finding Vulnerabilities -
Data Finding Sensitive Data 2 Low 3 Info
Secret Finding Secrets -
IaC Misconfiguration IaC Misconfigurations -
SAST Finding SAST Findings -
Software Management Finding Software Management Findings -
Total 2 Low 3 Info

View scan details in Wiz

To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension.

Comment thread terraform/grafana/dashboards/proxmox/rabbit-netbw.json
Comment thread terraform/grafana/folders.tf
Comment thread terraform/grafana/dashboards.tf
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Terraform Format and Style 🖌success

Terraform Initialization ⚙️success

Terraform Validation 🤖success

Validation Output

Success! The configuration is valid.


Terraform Plan 📖success

Show Plan

terraform
data.onepassword_item.grafana: Reading...
data.onepassword_item.grafana: Read complete after 7s [id=vaults/66qfxcmgwlhutunx6slav6fyve/items/fnpmabehc3obdrdwbdosw63z6m]
grafana_folder.kubenuc: Refreshing state... [id=0:kubenuc]
grafana_folder.k8s_vms_daniele: Refreshing state... [id=0:k8s-vms-daniele]
grafana_dashboard.kubenuc_film_tv_exporter: Refreshing state... [id=0:kn-film-tv-exporter]
grafana_dashboard.kubenuc_jenkins: Refreshing state... [id=0:kn-jenkins]
grafana_dashboard.kubenuc_cert_manager: Refreshing state... [id=0:kn-cert-manager]
grafana_dashboard.kubenuc_nut: Refreshing state... [id=0:kn-nut]
grafana_dashboard.kubenuc_zabbix: Refreshing state... [id=0:kn-zabbix]
grafana_dashboard.kubenuc_portainer: Refreshing state... [id=0:kn-portainer]
grafana_dashboard.kubenuc_haproxy_ingress: Refreshing state... [id=0:kn-haproxy-ingress]
grafana_dashboard.kubenuc_nextcloud: Refreshing state... [id=0:kn-nextcloud]
grafana_dashboard.kubenuc_s3: Refreshing state... [id=0:kn-s3]
grafana_dashboard.kubenuc_sso: Refreshing state... [id=0:kn-sso]
grafana_dashboard.kubenuc_falco: Refreshing state... [id=0:kn-falco]
grafana_dashboard.kubenuc_grafana_alloy: Refreshing state... [id=0:kn-grafana-alloy]
grafana_dashboard.kubenuc_jellyfin: Refreshing state... [id=0:kn-jellyfin]
grafana_dashboard.kubenuc_net_mon: Refreshing state... [id=0:kn-net-mon]
grafana_dashboard.kubenuc_unifi: Refreshing state... [id=0:kn-unifi]
grafana_dashboard.kubenuc_postgresql: Refreshing state... [id=0:kn-postgresql]
grafana_dashboard.kubenuc_jfrog_acr: Refreshing state... [id=0:kn-jfrog-acr]
grafana_dashboard.kubenuc_cloudflare: Refreshing state... [id=0:kn-cloudflare]
grafana_dashboard.kubenuc_openebs: Refreshing state... [id=0:kn-openebs]
grafana_dashboard.kubenuc_system_upgrade_controller: Refreshing state... [id=0:kn-system-upgrade-controller]
grafana_dashboard.kubenuc_harbor: Refreshing state... [id=0:kn-harbor]
grafana_dashboard.kubenuc_1password: Refreshing state... [id=0:kn-1password]
grafana_dashboard.kubenuc_bareos: Refreshing state... [id=0:kn-bareos]
grafana_dashboard.k8s_vms_daniele_blackbox: Refreshing state... [id=0:kv-blackbox]
grafana_dashboard.k8s_vms_daniele_cert_manager: Refreshing state... [id=0:kv-cert-manager]
grafana_dashboard.k8s_vms_daniele_cloudflare: Refreshing state... [id=0:kv-cloudflare]
grafana_dashboard.k8s_vms_daniele_teleport_agent: Refreshing state... [id=0:kv-teleport-agent]
grafana_dashboard.k8s_vms_daniele_semaphore: Refreshing state... [id=0:kv-semaphore]
grafana_dashboard.k8s_vms_daniele_node_exporter: Refreshing state... [id=0:kv-node-exporter]
grafana_dashboard.k8s_vms_daniele_1password: Refreshing state... [id=0:kv-1password]
grafana_dashboard.k8s_vms_daniele_awx: Refreshing state... [id=0:kv-awx]
grafana_dashboard.k8s_vms_daniele_system_upgrade_controller: Refreshing state... [id=0:kv-system-upgrade-controller]
grafana_dashboard.k8s_vms_daniele_grafana_alloy: Refreshing state... [id=0:kv-grafana-alloy]
grafana_dashboard.k8s_vms_daniele_falco: Refreshing state... [id=0:kv-falco]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # grafana_dashboard.proxmox_rabbit_netbw will be created
  + resource "grafana_dashboard" "proxmox_rabbit_netbw" {
      + config_json  = jsonencode(
            {
              + annotations   = {
                  + list = []
                }
              + editable      = true
              + panels        = [
                  + {
                      + collapsed = false
                      + gridPos   = {
                          + h = 1
                          + w = 24
                          + x = 0
                          + y = 0
                        }
                      + panels    = []
                      + title     = "Monthly Budget — rabbit-01-psp"
                      + type      = "row"
                    },
                  + {
                      + datasource  = {
                          + type = "prometheus"
                          + uid  = "grafanacloud-prom"
                        }
                      + fieldConfig = {
                          + defaults  = {
                              + color      = {
                                  + mode = "thresholds"
                                }
                              + mappings   = []
                              + max        = 100
                              + min        = 0
                              + thresholds = {
                                  + mode  = "absolute"
                                  + steps = [
                                      + {
                                          + color = "green"
                                          + value = null
                                        },
                                      + {
                                          + color = "yellow"
                                          + value = 70
                                        },
                                      + {
                                          + color = "orange"
                                          + value = 90
                                        },
                                      + {
                                          + color = "red"
                                          + value = 95
                                        },
                                    ]
                                }
                              + unit       = "percent"
                            }
                          + overrides = []
                        }
                      + gridPos     = {
                          + h = 8
                          + w = 8
                          + x = 0
                          + y = 1
                        }
                      + options     = {
                          + orientation          = "auto"
                          + reduceOptions        = {
                              + calcs  = [
                                  + "lastNotNull",
                                ]
                              + fields = ""
                              + values = false
                            }
                          + showThresholdLabels  = false
                          + showThresholdMarkers = true
                        }
                      + targets     = [
                          + {
                              + expr         = "(sum(increase(node_network_receive_bytes_total{site=\"bgy\",device=\"eno1\"}[$__range])) + sum(increase(node_network_transmit_bytes_total{site=\"bgy\",device=\"eno1\"}[$__range]))) / 25000000000000.0 * 100"
                              + instant      = true
                              + legendFormat = ""
                              + refId        = "A"
                            },
                        ]
                      + title       = "% of 25 TB Limit"
                      + type        = "gauge"
                    },
                  + {
                      + datasource  = {
                          + type = "prometheus"
                          + uid  = "grafanacloud-prom"
                        }
                      + fieldConfig = {
                          + defaults  = {
                              + color      = {
                                  + mode = "thresholds"
                                }
                              + mappings   = []
                              + thresholds = {
                                  + mode  = "absolute"
                                  + steps = [
                                      + {
                                          + color = "green"
                                          + value = null
                                        },
                                      + {
                                          + color = "yellow"
                                          + value = 17499999999999.998
                                        },
                                      + {
                                          + color = "orange"
                                          + value = 22500000000000
                                        },
                                      + {
                                          + color = "red"
                                          + value = 23750000000000
                                        },
                                    ]
                                }
                              + unit       = "decbytes"
                            }
                          + overrides = []
                        }
                      + gridPos     = {
                          + h = 4
                          + w = 8
                          + x = 8
                          + y = 1
                        }
                      + options     = {
                          + colorMode     = "background"
                          + graphMode     = "none"
                          + orientation   = "auto"
                          + reduceOptions = {
                              + calcs  = [
                                  + "lastNotNull",
                                ]
                              + fields = ""
                              + values = false
                            }
                          + textMode      = "auto"
                        }
                      + targets     = [
                          + {
                              + expr         = "sum(increase(node_network_receive_bytes_total{site=\"bgy\",device=\"eno1\"}[$__range])) + sum(increase(node_network_transmit_bytes_total{site=\"bgy\",device=\"eno1\"}[$__range]))"
                              + instant      = true
                              + legendFormat = ""
                              + refId        = "A"
                            },
                        ]
                      + title       = "Used This Month"
                      + type        = "stat"
                    },
                  + {
                      + datasource  = {
                          + type = "prometheus"
                          + uid  = "grafanacloud-prom"
                        }
                      + fieldConfig = {
                          + defaults  = {
                              + color      = {
                                  + mode = "thresholds"
                                }
                              + mappings   = []
                              + thresholds = {
                                  + mode  = "absolute"
                                  + steps = [
                                      + {
                                          + color = "red"
                                          + value = null
                                        },
                                      + {
                                          + color = "orange"
                                          + value = 1250000000000
                                        },
                                      + {
                                          + color = "yellow"
                                          + value = 2500000000000
                                        },
                                      + {
                                          + color = "green"
                                          + value = 7500000000000
                                        },
                                    ]
                                }
                              + unit       = "decbytes"
                            }
                          + overrides = []
                        }
                      + gridPos     = {
                          + h = 4
                          + w = 8
                          + x = 16
                          + y = 1
                        }
                      + options     = {
                          + colorMode     = "background"
                          + graphMode     = "none"
                          + orientation   = "auto"
                          + reduceOptions = {
                              + calcs  = [
                                  + "lastNotNull",
                                ]
                              + fields = ""
                              + values = false
                            }
                          + textMode      = "auto"
                        }
                      + targets     = [
                          + {
                              + expr         = "25000000000000.0 - (sum(increase(node_network_receive_bytes_total{site=\"bgy\",device=\"eno1\"}[$__range])) + sum(increase(node_network_transmit_bytes_total{site=\"bgy\",device=\"eno1\"}[$__range])))"
                              + instant      = true
                              + legendFormat = ""
                              + refId        = "A"
                            },
                        ]
                      + title       = "Remaining Budget"
                      + type        = "stat"
                    },
                  + {
                      + datasource  = {
                          + type = "prometheus"
                          + uid  = "grafanacloud-prom"
                        }
                      + fieldConfig = {
                          + defaults  = {
                              + color      = {
                                  + mode = "thresholds"
                                }
                              + mappings   = []
                              + thresholds = {
                                  + mode  = "absolute"
                                  + steps = [
                                      + {
                                          + color = "blue"
                                          + value = null
                                        },
                                    ]
                                }
                              + unit       = "decbytes"
                            }
                          + overrides = []
                        }
                      + gridPos     = {
                          + h = 4
                          + w = 16
                          + x = 8
                          + y = 5
                        }
                      + options     = {
                          + colorMode     = "background"
                          + graphMode     = "none"
                          + orientation   = "auto"
                          + reduceOptions = {
                              + calcs  = [
                                  + "lastNotNull",
                                ]
                              + fields = ""
                              + values = false
                            }
                          + textMode      = "auto"
                        }
                      + targets     = [
                          + {
                              + expr         = "(sum(increase(node_network_receive_bytes_total{site=\"bgy\",device=\"eno1\"}[$__range])) + sum(increase(node_network_transmit_bytes_total{site=\"bgy\",device=\"eno1\"}[$__range]))) / ($__range_s / 86400)"
                              + instant      = true
                              + legendFormat = ""
                              + refId        = "A"
                            },
                        ]
                      + title       = "Daily Average"
                      + type        = "stat"
                    },
                  + {
                      + collapsed = false
                      + gridPos   = {
                          + h = 1
                          + w = 24
                          + x = 0
                          + y = 9
                        }
                      + panels    = []
                      + title     = "Traffic Rate"
                      + type      = "row"
                    },
                  + {
                      + datasource  = {
                          + type = "prometheus"
                          + uid  = "grafanacloud-prom"
                        }
                      + fieldConfig = {
                          + defaults  = {
                              + custom = {
                                  + fillOpacity  = 10
                                  + gradientMode = "none"
                                  + lineWidth    = 1
                                  + spanNulls    = false
                                }
                              + unit   = "binBps"
                            }
                          + overrides = []
                        }
                      + gridPos     = {
                          + h = 8
                          + w = 24
                          + x = 0
                          + y = 10
                        }
                      + options     = {
                          + legend  = {
                              + calcs       = [
                                  + "lastNotNull",
                                  + "max",
                                ]
                              + displayMode = "table"
                              + placement   = "bottom"
                            }
                          + tooltip = {
                              + mode = "multi"
                              + sort = "desc"
                            }
                        }
                      + targets     = [
                          + {
                              + expr         = "rate(node_network_receive_bytes_total{site=\"bgy\",device=\"eno1\"}[1h])"
                              + legendFormat = "↓ rx"
                              + refId        = "A"
                            },
                          + {
                              + expr         = "rate(node_network_transmit_bytes_total{site=\"bgy\",device=\"eno1\"}[1h])"
                              + legendFormat = "↑ tx"
                              + refId        = "B"
                            },
                        ]
                      + title       = "Inbound / Outbound Rate (eno1)"
                      + type        = "timeseries"
                    },
                ]
              + refresh       = "5m"
              + schemaVersion = 38
              + tags          = [
                  + "proxmox",
                  + "rabbit",
                  + "network",
                  + "bandwidth",
                ]
              + templating    = {
                  + list = []
                }
              + time          = {
                  + from = "now/M"
                  + to   = "now"
                }
              + timepicker    = {}
              + timezone      = "browser"
              + title         = "rabbit-01-psp — Network Bandwidth"
              + uid           = "pve-rabbit-netbw"
            }
        )
      + dashboard_id = (known after apply)
      + folder       = "proxmox"
      + id           = (known after apply)
      + uid          = (known after apply)
      + url          = (known after apply)
      + version      = (known after apply)
    }

  # grafana_folder.proxmox will be created
  + resource "grafana_folder" "proxmox" {
      + id                           = (known after apply)
      + prevent_destroy_if_not_empty = false
      + title                        = "proxmox"
      + uid                          = "proxmox"
      + url                          = (known after apply)
    }

Plan: 2 to add, 0 to change, 0 to destroy.

─────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't
guarantee to take exactly these actions if you run "terraform apply" now.

Pusher: @dark-vex, Action: pull_request, Working Directory: terraform/grafana, Workflow: TF Grafana Dashboards

eno1 is the internet-facing interface on rabbit-01-psp that counts
toward the 25 TB housing quota. Scoping all six PromQL expressions to
{device="eno1"} makes the intent explicit and prevents spurious data
if additional interfaces are ever added to the host.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant