diff --git a/.README.html b/.README.html index e20f913..af9993e 100644 --- a/.README.html +++ b/.README.html @@ -156,6 +156,8 @@
false by default.
Default: false
Type: bool
When running on Azure systems, the role automatically installs Azure +platform packages, e.g. VM management infrastructure and storage +utilities.
+WALinuxAgent: Azure Linux Agent manages Linux +provisioning and VM interaction with the Azure Fabric Controller.
+aznfs: Azure NFS mount helper is Azure-optimized NFS +client that simplifies mounting Azure Blob Storage containers over NFS +v3 and applies client-side optimizations for improved performance. The +package is installed from the Microsoft Production repository with +non-interactive installation mode enabled. For more information, see https://github.com/Azure/AZNFS-mount.
Whether to install the CUDA Driver package.
Default: true
Default: true
Type: bool
Whether to install NVIDIA IMEX (nvidia-imex) and enable
+nvidia-imex.service.
Note: "This role installs and enables the nvidia-imex service but +does not start it immediately. The service is configured to launch at +boot only on compatible multi-node NVLink switch-fabric systems, such as +NVIDIA GB200 or GB300 (NVL72) racks."
+Default: true
Type: bool
Whether to install the NVIDIA datacenter GPU manager(DCGM) and enable +its nvidia-dcgm service.
+NVIDIA DCGM is a GPU monitoring and management toolkit for +large-scale GPU deployments, install DCGM on all GPU nodes in an HPC +cluster to maintain reliability and monitor GPU health.
+Run dcgmi in the GPU nodes, e.g.
+dcgmi discovery -l to list GPUs on the node.
Default: true
Type: bool
Whether to install the NVIDIA RDMA package.
Default: true
Type: bool
Whether to configure a persistent RDMA device naming scheme on +Azure:
+/usr/sbin/azure_persistent_rdma_naming.shazure_persistent_rdma_naming.serviceThis is automatically skipped on non-Azure systems.
+Default: true
Type: bool
Whether to disable predictable network interface names by adding
+net.ifnames=0 to the kernel command line (via the
+bootloader system role).
This keeps kernel names such as ib0, ib1,
+... instead of ibP... on IPoIB, but it also affects
+Ethernet naming (e.g. eth0 instead of
+enP...).
Default: true
Type: bool
Whether to install OpenMPI that comes from AppStream repositories and does not have Nvidia GPU support.
@@ -344,6 +432,89 @@Default: true
Type: bool
Whether to install and configure NVIDIA Container Toolkit.
+This enables GPU support in Docker and containerd by installing the
+nvidia-container-toolkit package. Note that enabling this variable
+automatically sets hpc_install_docker: true unless you
+explicitly override it.
Default: true
Type: bool
Whether to install the moby-engine and moby-cli packages as well as
+enable the Docker service. To explicitly disable Docker even when using
+the NVIDIA Container Toolkit, you need to set this to
+false, please note that the role will fail unless you also
+disable hpc_install_nvidia_container_toolkit.
Default:
+"{{ hpc_install_nvidia_container_toolkit }}"
Type: bool
The default docker bridge interface address and subnet configuration +of 172.17.0.1/16 conflicts with the subnets Azure CycleCloud uses for +internal physical cluster networks.
+To avoid this conflict with the Azure CycleCloud networks, the system +role will configure the docker interface with a 10.88.0.1/16 address and +subnet. However, if this is inappropriate for the cluster being +deployed, the subnet can be customised to any private subnet using this +variable.
+Default: 10.88.0.1/16
Type: string
Whether to install the Azure Moneo monitoring tool.
+Moneo is a distributed GPU system monitor for AI training and +inferencing clusters. It collects GPU telemetry and supports integration +with Azure Monitor.
+The role installs Moneo to /opt/hpc/azure/tools/Moneo and adds an +alias moneo to /etc/bashrc for easy access.
+For more information, see https://github.com/Azure/Moneo.
+Whether to install the Azure HPC Diagnostics tool.
+The Azure HPC Diagnostics tool gathers system information for triage +and debugging purposes. It collects information and state from the +hardware, OS, Azure environment and installed applications, then +packages it into a tarball to simplify the process of system support and +bug triage.
+To gather diagnostics, run:
+/opt/hpc/azure/tools/gather_azhpc_vm_diagnostics.shThe script will indicate where the tarball containing the diagnostic +information can be found.
+For more information, see https://github.com/Azure/azhpc-diagnostics/
+Default: true
Type: bool
Whether to install the Azure KVP (Key-Value Pair) client.
+The KVP client is a tool for reading and writing key-value pairs from
+the Azure host to the guest VM. It is compiled from source and installed
+to /opt/hpc/azure/tools/kvp_client.
This tool is Azure-specific and should only be installed on Azure +platforms.
+Default: true
Type: bool
Whether to install and configure Azure HPC Health Checks (AZNHC).
+This downloads the azurehpc-health-checks toolkit, configures it for +the target GPU platform, and pulls the appropriate Docker container +image from MCR. The health checks validate HPC components including +GPUs, InfiniBand, storage, and MPI operations. For more information, see +https://github.com/Azure/azurehpc-health-checks.
+The role installs the toolkit in
+/opt/hpc/azure/tests/azurehpc-health-checks/ and pulls
+mcr.microsoft.com/aznhc/aznhc-nv:latest
Note that NVIDIA Container Toolkit must be installed and at least 20G +of free space in /var is required for first-time aznhc-nv docker image +download. If the image does not exist and /var has insufficient space, +installation will be skipped with a warning. See Expand +virtual hard disks on a Linux VM for disk expansion details.
+Default: true
Type: bool
Default: true
Type: bool
Whether to install the hardware tuning files for different Azure VM +types (SKUs).
+This will install definitions for optimal hardware configurations for +the different types of high performance VMs that are typically used for +HPC workloads in the Azure environment. These include InfiniBand and +GPU/NVLink and NCCL customisations, as well as any workarounds for +specific hardware problems that may be needed.
+Default: true
Type: bool
Type: bool
- name: Configure my virtual machine for HPC
- hosts: localhost
- vars:
- hpc_install_cuda_driver: true
- hpc_install_cuda_toolkit: true
- hpc_install_hpc_nvidia_nccl: true
- hpc_install_nvidia_fabric_manager: true
- hpc_install_rdma: true
- hpc_install_system_openmpi: true
- hpc_build_openmpi_w_nvidia_gpu_support: true
- roles:
- - linux-system-roles.hpc- name: Configure my virtual machine for HPC
+ hosts: localhost
+ vars:
+ hpc_install_cuda_driver: true
+ hpc_install_cuda_toolkit: true
+ hpc_install_hpc_nvidia_nccl: true
+ hpc_install_nvidia_fabric_manager: true
+ hpc_install_rdma: true
+ hpc_install_system_openmpi: true
+ hpc_build_openmpi_w_nvidia_gpu_support: true
+ roles:
+ - linux-system-roles.hpcType: bool
By default, the role ensures that rootlv and
-usrlv in Azure has enough storage for packages to be
-installed. You can use variables described in this section to control
-the exact sizes and paths.
By default, the role ensures that rootlv,
+usrlv and varlv in Azure has enough storage
+for packages to be installed. You can use variables described in this
+section to control the exact sizes and paths.
Whether to configure the VG from hpc_rootvg_name to have logical volumes hpc_rootlv_name and hpc_usrlv_name with indicated sizes and +href="#hpc_rootlv_name">hpc_rootlv_name, hpc_usrlv_name and hpc_varlv_name with indicated sizes and mounted to indicated mount points.
+When enabled, it will also automatically handle disk expansion by
+resizing partitions (via growpart), physical volumes (via
+pvresize), as well as logical volumes.
Note that the role configures not the exact size, but ensures that the size is at least as indicated, i.e. the role won't shrink logical volumes.
@@ -437,8 +622,9 @@Type: bool
Name of the root volume group to use. The role configures logical -volumes hpc_rootlv_name and hpc_usrlv_name to extend them to the size +volumes hpc_rootlv_name, hpc_usrlv_name and hpc_varlv_name to extend them to the size required to install HPC packages.
Default: rootvg
Type: string
Default: /usr
Type: string
Name of the var logical volume to use.
Default: varlv
Type: string
The size of the hpc_varlv_name logical +volume to configure.
+Note that the role configures not the exact size, but ensures that +the size is at least as indicated, i.e. the role won't shrink logical +volumes if current size is larger than value of this variable.
+Default: 10G
Type: string
Mount point of the hpc_varlv_name +logical volume to configure.
+Default: /var
Type: string
- name: Configure my virtual machine for HPC
- hosts: localhost
- vars:
- hpc_manage_storage: true
- hpc_rootvg_name: rootvg
- hpc_rootlv_name: rootlv
- hpc_rootlv_size: 10G
- hpc_rootlv_mount: /
- hpc_usrlv_name: usrlv
- hpc_usrlv_size: 20G
- hpc_usrlv_mount: /usr
- roles:
- - linux-system-roles.hpcDefault false - if true, this means a
-reboot is needed to apply the changes made by the role.
Run the role to configure storage, install all packages, and reboot -if needed.
- name: Configure my virtual machine for HPC
hosts: localhost
@@ -512,18 +693,46 @@ Example Playbooks
hpc_usrlv_name: usrlv
hpc_usrlv_size: 20G
hpc_usrlv_mount: /usr
-
- hpc_install_cuda_driver: true
- hpc_install_cuda_toolkit: true
- hpc_install_hpc_nvidia_nccl: true
- hpc_install_nvidia_fabric_manager: true
- hpc_install_rdma: true
- hpc_install_system_openmpi: true
- hpc_build_openmpi_w_nvidia_gpu_support: true
-
- hpc_reboot_ok: true
- roles:
- - linux-system-roles.hpcDefault false - if true, this means a
+reboot is needed to apply the changes made by the role.
Run the role to configure storage, install all packages, and reboot +if needed.
+- name: Configure my virtual machine for HPC
+ hosts: localhost
+ vars:
+ hpc_manage_storage: true
+ hpc_rootvg_name: rootvg
+ hpc_rootlv_name: rootlv
+ hpc_rootlv_size: 10G
+ hpc_rootlv_mount: /
+ hpc_usrlv_name: usrlv
+ hpc_usrlv_size: 20G
+ hpc_usrlv_mount: /usr
+ hpc_varlv_name: varlv
+ hpc_varlv_size: 10G
+ hpc_varlv_mount: /var
+
+ hpc_install_cuda_driver: true
+ hpc_install_cuda_toolkit: true
+ hpc_install_hpc_nvidia_nccl: true
+ hpc_install_nvidia_fabric_manager: true
+ hpc_install_rdma: true
+ hpc_install_system_openmpi: true
+ hpc_build_openmpi_w_nvidia_gpu_support: true
+
+ hpc_reboot_ok: true
+ roles:
+ - linux-system-roles.hpcSee README-ostree.md