Skip to content

Nested Proxmox integration testing for DR validation #167

@cwage

Description

@cwage

Background

From #113 (Phase 2.4): we want to validate that tofu apply + Ansible actually work against a clean Proxmox instance, without requiring literal new hardware. Nested virtualization makes this possible -- run a test Proxmox VE instance as a VM on pve1 and exercise the full IaC stack against it.

This is aspirational/low-priority, but would be a high-confidence way to validate the DR rebuild path periodically.

Approach

Run Proxmox VE as a nested VM on pve1 (Ryzen 1800X supports kvm_amd nested=1). The nested VMs don't need to run real workloads -- they just need to exist long enough to prove the automation converges.

What it would validate

  • tofu apply successfully creates VMs against a clean Proxmox
  • VMs come up with correct specs (CPU, RAM, disk, network)
  • Ansible can SSH in and configure them (roles converge on fresh Debian)
  • NFS mount setup works (even against a stub NFS server on the test network)
  • The full rebuild sequence documented in Backup strategy and disaster recovery planning #113 Phase 2 actually works

Implementation sketch

  1. Enable nested virt on pve1 -- options kvm_amd nested=1 kernel module param
  2. Automated Proxmox install -- Proxmox supports answer files for unattended installs. Script the ISO boot + install into a VM.
  3. Isolated test network -- separate bridge or VLAN so test DHCP/DNS doesn't stomp production
  4. Parameterize tofu -- target either prod or test Proxmox API endpoint (probably just a tfvars override)
  5. Run tofu + Ansible -- apply against test instance, run playbooks against nested VMs
  6. Validate -- assert VMs exist, services configured, mounts present
  7. Tear down -- destroy the test Proxmox VM and reclaim resources

Resource budget

pve1 has 64GB RAM, ~10.5GB allocated to prod VMs. A test Proxmox VM + a couple tiny nested VMs could fit in ~8-12GB. Tight but workable if not running alongside peak workloads. Performance will be slow (nested virt) but correctness is what matters, not speed.

Lighter-weight alternatives to consider

  • Molecule for Ansible roles -- test roles against Docker containers instead of real VMs. Easier to set up, could run in CI, but doesn't exercise the Proxmox/tofu layer.
  • tofu plan only -- validates config syntax and provider connectivity without applying. Catches drift but doesn't prove apply works.
  • On-demand cloud host -- e.g. Hetzner dedicated with nested virt (~$40/mo), spun up for monthly DR tests, torn down after. Fully isolated from homelab, proves "hotel room recovery" works.

Open questions

  • How to handle the Proxmox API user/token for the test instance (hardcoded test creds? separate OpenBao path?)
  • Whether to simulate NFS with a local stub or skip mount validation
  • Whether this should be a manual "run quarterly" thing or fully automated in CI

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions