diff --git a/.README.html b/.README.html index e20f913..af9993e 100644 --- a/.README.html +++ b/.README.html @@ -156,6 +156,8 @@

Contents

id="toc-hpc_update_kernel">hpc_update_kernel
  • hpc_update_all_packages
  • +
  • Azure-specific packages
  • hpc_install_cuda_driver
  • Contents id="toc-hpc_install_hpc_nvidia_nccl">hpc_install_hpc_nvidia_nccl
  • hpc_install_nvidia_fabric_manager
  • +
  • hpc_install_nvidia_imex
  • +
  • hpc_install_nvidia_dcgm
  • hpc_install_rdma
  • +
  • hpc_enable_azure_persistent_rdma_naming
  • +
  • hpc_azure_disable_predictable_net_names
  • hpc_install_system_openmpi
  • hpc_build_openmpi_w_nvidia_gpu_support
  • +
  • hpc_install_nvidia_container_toolkit
  • +
  • hpc_install_docker
  • +
  • hpc_docker_subnet
  • +
  • hpc_install_moneo
  • +
  • hpc_install_diagnostics
  • +
  • hpc_install_kvp_client
  • +
  • hpc_install_azurehpc_health_checks
  • Variables for Configuring Tuning for HPC Workloads
  • Variables @@ -214,6 +240,12 @@

    Contents

    id="toc-hpc_usrlv_size">hpc_usrlv_size
  • hpc_usrlv_mount
  • +
  • hpc_varlv_name
  • +
  • hpc_varlv_size
  • +
  • hpc_varlv_mount
  • Example Playbook for Configuring Storage
  • @@ -280,6 +312,18 @@

    hpc_update_all_packages

    is set to false by default.

    Default: false

    Type: bool

    +

    Azure-specific packages

    +

    When running on Azure systems, the role automatically installs Azure +platform packages, e.g. VM management infrastructure and storage +utilities.

    +

    WALinuxAgent: Azure Linux Agent manages Linux +provisioning and VM interaction with the Azure Fabric Controller.

    +

    aznfs: Azure NFS mount helper is Azure-optimized NFS +client that simplifies mounting Azure Blob Storage containers over NFS +v3 and applies client-side optimizations for improved performance. The +package is installed from the Microsoft Production repository with +non-interactive installation mode enabled. For more information, see https://github.com/Azure/AZNFS-mount.

    hpc_install_cuda_driver

    Whether to install the CUDA Driver package.

    Default: true

    @@ -301,10 +345,54 @@

    hpc_install_hpc_nvidia_nccl

    nvidia-fabricmanager service.

    Default: true

    Type: bool

    +

    hpc_install_nvidia_imex

    +

    Whether to install NVIDIA IMEX (nvidia-imex) and enable +nvidia-imex.service.

    +

    Note: "This role installs and enables the nvidia-imex service but +does not start it immediately. The service is configured to launch at +boot only on compatible multi-node NVLink switch-fabric systems, such as +NVIDIA GB200 or GB300 (NVL72) racks."

    +

    Default: true

    +

    Type: bool

    +

    hpc_install_nvidia_dcgm

    +

    Whether to install the NVIDIA datacenter GPU manager(DCGM) and enable +its nvidia-dcgm service.

    +

    NVIDIA DCGM is a GPU monitoring and management toolkit for +large-scale GPU deployments, install DCGM on all GPU nodes in an HPC +cluster to maintain reliability and monitor GPU health.

    +

    Run dcgmi in the GPU nodes, e.g. +dcgmi discovery -l to list GPUs on the node.

    +

    Default: true

    +

    Type: bool

    hpc_install_rdma

    Whether to install the NVIDIA RDMA package.

    Default: true

    Type: bool

    +

    hpc_enable_azure_persistent_rdma_naming

    +

    Whether to configure a persistent RDMA device naming scheme on +Azure:

    + +

    This is automatically skipped on non-Azure systems.

    +

    Default: true

    +

    Type: bool

    +

    hpc_azure_disable_predictable_net_names

    +

    Whether to disable predictable network interface names by adding +net.ifnames=0 to the kernel command line (via the +bootloader system role).

    +

    This keeps kernel names such as ib0, ib1, +... instead of ibP... on IPoIB, but it also affects +Ethernet naming (e.g. eth0 instead of +enP...).

    +

    Default: true

    +

    Type: bool

    hpc_install_system_openmpi

    Whether to install OpenMPI that comes from AppStream repositories and does not have Nvidia GPU support.

    @@ -344,6 +432,89 @@

    hpc_install_system_openmpi

    hpc_install_hpc_nvidia_nccl: true

    Default: true

    Type: bool

    +

    hpc_install_nvidia_container_toolkit

    +

    Whether to install and configure NVIDIA Container Toolkit.

    +

    This enables GPU support in Docker and containerd by installing the +nvidia-container-toolkit package. Note that enabling this variable +automatically sets hpc_install_docker: true unless you +explicitly override it.

    +

    Default: true

    +

    Type: bool

    +

    hpc_install_docker

    +

    Whether to install the moby-engine and moby-cli packages as well as +enable the Docker service. To explicitly disable Docker even when using +the NVIDIA Container Toolkit, you need to set this to +false, please note that the role will fail unless you also +disable hpc_install_nvidia_container_toolkit.

    +

    Default: +"{{ hpc_install_nvidia_container_toolkit }}"

    +

    Type: bool

    +

    hpc_docker_subnet

    +

    The default docker bridge interface address and subnet configuration +of 172.17.0.1/16 conflicts with the subnets Azure CycleCloud uses for +internal physical cluster networks.

    +

    To avoid this conflict with the Azure CycleCloud networks, the system +role will configure the docker interface with a 10.88.0.1/16 address and +subnet. However, if this is inappropriate for the cluster being +deployed, the subnet can be customised to any private subnet using this +variable.

    +

    Default: 10.88.0.1/16

    +

    Type: string

    +

    hpc_install_moneo

    +

    Whether to install the Azure Moneo monitoring tool.

    +

    Moneo is a distributed GPU system monitor for AI training and +inferencing clusters. It collects GPU telemetry and supports integration +with Azure Monitor.

    +

    The role installs Moneo to /opt/hpc/azure/tools/Moneo and adds an +alias moneo to /etc/bashrc for easy access.

    +

    For more information, see https://github.com/Azure/Moneo.

    +

    hpc_install_diagnostics

    +

    Whether to install the Azure HPC Diagnostics tool.

    +

    The Azure HPC Diagnostics tool gathers system information for triage +and debugging purposes. It collects information and state from the +hardware, OS, Azure environment and installed applications, then +packages it into a tarball to simplify the process of system support and +bug triage.

    +

    To gather diagnostics, run:

    +
    /opt/hpc/azure/tools/gather_azhpc_vm_diagnostics.sh
    +

    The script will indicate where the tarball containing the diagnostic +information can be found.

    +

    For more information, see https://github.com/Azure/azhpc-diagnostics/

    +

    Default: true

    +

    Type: bool

    +

    hpc_install_kvp_client

    +

    Whether to install the Azure KVP (Key-Value Pair) client.

    +

    The KVP client is a tool for reading and writing key-value pairs from +the Azure host to the guest VM. It is compiled from source and installed +to /opt/hpc/azure/tools/kvp_client.

    +

    This tool is Azure-specific and should only be installed on Azure +platforms.

    +

    Default: true

    +

    Type: bool

    +

    hpc_install_azurehpc_health_checks

    +

    Whether to install and configure Azure HPC Health Checks (AZNHC).

    +

    This downloads the azurehpc-health-checks toolkit, configures it for +the target GPU platform, and pulls the appropriate Docker container +image from MCR. The health checks validate HPC components including +GPUs, InfiniBand, storage, and MPI operations. For more information, see +https://github.com/Azure/azurehpc-health-checks.

    +

    The role installs the toolkit in +/opt/hpc/azure/tests/azurehpc-health-checks/ and pulls +mcr.microsoft.com/aznhc/aznhc-nv:latest

    +

    Note that NVIDIA Container Toolkit must be installed and at least 20G +of free space in /var is required for first-time aznhc-nv docker image +download. If the image does not exist and /var has insufficient space, +installation will be skipped with a warning. See Expand +virtual hard disks on a Linux VM for disk expansion details.

    +

    Default: true

    +

    Type: bool

    Variables for Configuring Tuning for HPC Workloads

    hpc_tuning

    @@ -372,6 +543,16 @@

    hpc_tuning

    Default: true

    Type: bool

    +

    hpc_sku_customisation

    +

    Whether to install the hardware tuning files for different Azure VM +types (SKUs).

    +

    This will install definitions for optimal hardware configurations for +the different types of high performance VMs that are typically used for +HPC workloads in the Azure environment. These include InfiniBand and +GPU/NVLink and NCCL customisations, as well as any workarounds for +specific hardware problems that may be needed.

    +

    Default: true

    +

    Type: bool

    Variables for Configuring How Role Reboots Managed Nodes

    @@ -389,19 +570,19 @@

    hpc_reboot_ok

    Type: bool

    Example Playbook for Configuring Packages

    -
    - name: Configure my virtual machine for HPC
    -  hosts: localhost
    -  vars:
    -    hpc_install_cuda_driver: true
    -    hpc_install_cuda_toolkit: true
    -    hpc_install_hpc_nvidia_nccl: true
    -    hpc_install_nvidia_fabric_manager: true
    -    hpc_install_rdma: true
    -    hpc_install_system_openmpi: true
    -    hpc_build_openmpi_w_nvidia_gpu_support: true
    -  roles:
    -    - linux-system-roles.hpc
    +
    - name: Configure my virtual machine for HPC
    +  hosts: localhost
    +  vars:
    +    hpc_install_cuda_driver: true
    +    hpc_install_cuda_toolkit: true
    +    hpc_install_hpc_nvidia_nccl: true
    +    hpc_install_nvidia_fabric_manager: true
    +    hpc_install_rdma: true
    +    hpc_install_system_openmpi: true
    +    hpc_build_openmpi_w_nvidia_gpu_support: true
    +  roles:
    +    - linux-system-roles.hpc

    Variables for Configuring Firewall

    hpc_manage_firewall

    @@ -420,16 +601,20 @@

    hpc_manage_firewall

    Type: bool

    Variables for Configuring Storage

    -

    By default, the role ensures that rootlv and -usrlv in Azure has enough storage for packages to be -installed. You can use variables described in this section to control -the exact sizes and paths.

    +

    By default, the role ensures that rootlv, +usrlv and varlv in Azure has enough storage +for packages to be installed. You can use variables described in this +section to control the exact sizes and paths.

    hpc_manage_storage

    Whether to configure the VG from hpc_rootvg_name to have logical volumes hpc_rootlv_name and hpc_usrlv_name with indicated sizes and +href="#hpc_rootlv_name">hpc_rootlv_name, hpc_usrlv_name and hpc_varlv_name with indicated sizes and mounted to indicated mount points.

    +

    When enabled, it will also automatically handle disk expansion by +resizing partitions (via growpart), physical volumes (via +pvresize), as well as logical volumes.

    Note that the role configures not the exact size, but ensures that the size is at least as indicated, i.e. the role won't shrink logical volumes.

    @@ -437,8 +622,9 @@

    hpc_manage_storage

    Type: bool

    hpc_rootvg_name

    Name of the root volume group to use. The role configures logical -volumes hpc_rootlv_name and hpc_usrlv_name to extend them to the size +volumes hpc_rootlv_name, hpc_usrlv_name and hpc_varlv_name to extend them to the size required to install HPC packages.

    Default: rootvg

    Type: string

    @@ -476,30 +662,25 @@

    hpc_usrlv_mount

    logical volume to configure.

    Default: /usr

    Type: string

    +

    hpc_varlv_name

    +

    Name of the var logical volume to use.

    +

    Default: varlv

    +

    Type: string

    +

    hpc_varlv_size

    +

    The size of the hpc_varlv_name logical +volume to configure.

    +

    Note that the role configures not the exact size, but ensures that +the size is at least as indicated, i.e. the role won't shrink logical +volumes if current size is larger than value of this variable.

    +

    Default: 10G

    +

    Type: string

    +

    hpc_varlv_mount

    +

    Mount point of the hpc_varlv_name +logical volume to configure.

    +

    Default: /var

    +

    Type: string

    Example Playbook for Configuring Storage

    -
    - name: Configure my virtual machine for HPC
    -  hosts: localhost
    -  vars:
    -    hpc_manage_storage: true
    -    hpc_rootvg_name: rootvg
    -    hpc_rootlv_name: rootlv
    -    hpc_rootlv_size: 10G
    -    hpc_rootlv_mount: /
    -    hpc_usrlv_name: usrlv
    -    hpc_usrlv_size: 20G
    -    hpc_usrlv_mount: /usr
    -  roles:
    -    - linux-system-roles.hpc
    -

    Variables Exported by the -Role

    -

    hpc_reboot_needed

    -

    Default false - if true, this means a -reboot is needed to apply the changes made by the role.

    -

    Example Playbooks

    -

    Run the role to configure storage, install all packages, and reboot -if needed.

    - name: Configure my virtual machine for HPC
       hosts: localhost
    @@ -512,18 +693,46 @@ 

    Example Playbooks

    hpc_usrlv_name: usrlv hpc_usrlv_size: 20G hpc_usrlv_mount: /usr - - hpc_install_cuda_driver: true - hpc_install_cuda_toolkit: true - hpc_install_hpc_nvidia_nccl: true - hpc_install_nvidia_fabric_manager: true - hpc_install_rdma: true - hpc_install_system_openmpi: true - hpc_build_openmpi_w_nvidia_gpu_support: true - - hpc_reboot_ok: true - roles: - - linux-system-roles.hpc
    + hpc_varlv_name: varlv + hpc_varlv_size: 10G + hpc_varlv_mount: /var + roles: + - linux-system-roles.hpc +

    Variables Exported by the +Role

    +

    hpc_reboot_needed

    +

    Default false - if true, this means a +reboot is needed to apply the changes made by the role.

    +

    Example Playbooks

    +

    Run the role to configure storage, install all packages, and reboot +if needed.

    +
    - name: Configure my virtual machine for HPC
    +  hosts: localhost
    +  vars:
    +    hpc_manage_storage: true
    +    hpc_rootvg_name: rootvg
    +    hpc_rootlv_name: rootlv
    +    hpc_rootlv_size: 10G
    +    hpc_rootlv_mount: /
    +    hpc_usrlv_name: usrlv
    +    hpc_usrlv_size: 20G
    +    hpc_usrlv_mount: /usr
    +    hpc_varlv_name: varlv
    +    hpc_varlv_size: 10G
    +    hpc_varlv_mount: /var
    +
    +    hpc_install_cuda_driver: true
    +    hpc_install_cuda_toolkit: true
    +    hpc_install_hpc_nvidia_nccl: true
    +    hpc_install_nvidia_fabric_manager: true
    +    hpc_install_rdma: true
    +    hpc_install_system_openmpi: true
    +    hpc_build_openmpi_w_nvidia_gpu_support: true
    +
    +    hpc_reboot_ok: true
    +  roles:
    +    - linux-system-roles.hpc

    rpm-ostree

    See README-ostree.md

    License

    diff --git a/CHANGELOG.md b/CHANGELOG.md index 337d2f5..6bf05cb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,78 @@ Changelog ========= +[0.4.0] - 2026-03-23 +-------------------- + +### New Features + +- feat: Moneo monitoring tool package (#46) +- feat: Installing Moby container runtime and NVIDIA Container Toolkit (#47) +- feat: add variables for azure resources and tools (#48) +- feat: SKU customisations (#49) +- feat: add expanding rootvg-varlv size function (#51) +- feat: Install and configure Azure HPC Health Checks (#52) +- feat: RDMA naming infra changes (#67) +- feat: refine hpc_tuning and add additional tunings (#70) +- feat: add AZNFS mount helper installation (#72) +- feat: install the Azure HPC Diagnostics script (#76) +- feat: add support for disk partition expansion and PV resize (#80) +- feat: install __hpc_base_packages early via dedicated task (#83) +- feat: gate NVIDIA IMEX enablement to GB200/GB300 NVLink systems (#85) +- feat: Add NVIDIA DCGM installation (#100) + +### Bug Fixes + +- fix: Change installation path/location for moneo tool (#54) +- fix: fix added for moneo install path (#59) +- fix: address ansible-lint issues in Azure health check PR #52 (#63) +- fix: change the condition about lv expansion to use integer comparison (#66) +- fix: change nvidia-container-toolkit repo and remove version lock (#68) +- fix: do not pull in OFED IB drivers for the persistent naming monitor (#71) +- fix: __MOCK_SKU is uninitialised when run from init services (#74) +- fix: CI fails tests because /var is too small (#75) +- fix: versionlock kernel-devel-matched to prevent depsolve errors (#79) +- fix: Don't try to configure WAAgent in non-Azure environments (#81) +- fix: sku_customisation.service file should not be executable (#84) +- fix: use an alternate subnet for the docker bridge network (#90) +- fix: run azure-specific installation after resource path created (#91) +- fix: correct typo in service running test (#92) +- fix: moneo test-script fixes (#95) +- fix: install cuda-toolkit-config-common-12.9.79-1 with cuda-toolkit 12 (#97) +- fix: install RDMA test script after azure specific resource path created (#98) +- fix: add opt-in net.ifnames=0 for Azure images (#101) +- fix: resolve nvidia-persistenced service failure issue on race condition (#102) +- fix: prevent Azure-specific tasks from running on non-Azure platforms (#104) +- fix: replace unsupported patch module with patch command (#105) + +### Other Changes + +- refactor: handle INJECT_FACTS_AS_VARS=false by using ansible_facts instead (#44) +- ci: use ANSIBLE_INJECT_FACT_VARS=false by default for testing (#45) +- test: SKU customisations (#50) +- test: Added Testcases for testing moneo tool (#53) +- test: skip hpc_install_nvidia_fabric_manager in skip_toolkit test (#55) +- test: do not install moneo (#57) +- ci: bump ansible/ansible-lint from 25 to 26 (#58) +- build: Add a hidden collection directory to be used for building RPM (#60) +- ci: skip most CI checks if title contains citest skip [citest_skip] (#61) +- chore: Update nvidia-driver and fabricmanager to 580 (#62) +- ci: ansible-lint - remove .collection directory from converted collection [citest_skip] (#65) +- test: add Azure health check test script for basic validation (#69) +- ci: tox-lsr version 3.15.0 [citest_skip] (#73) +- test: Added RDMA validation script for waagent, ibverbs tools, and Azure persistent naming (#77) +- ci: Add Fedora 43, remove Fedora 41 from Testing Farm CI (#78) +- ci: Ansible version must be string, not float [citest_skip] (#82) +- test: add test script for aznfs package (#86) +- ci: bump actions/upload-artifact from 6 to 7 (#88) +- test: add testing Nvidia docker container script (#89) +- test: add validation for hpc tuning (#93) +- ci: tox-lsr 3.16.0 - fix qemu tox test failures - rename to qemu-ansible-core-X-Y [citest_skip] (#94) +- ci: tox-lsr 3.17.0 - container test improvements, use ansible 2.20 for fedora 43 [citest_skip] (#96) +- ci: tox-lsr 3.17.1 - previous update broke container tests, this fixes them [citest_skip] (#99) +- tests: add diagnostics installation validation script (#103) +- test: remove redundant tuning tests from tests_skip_toolkit.yml (#106) + [0.3.2] - 2026-01-06 --------------------