Skip to content

Commit 195b8b5

Browse files
committed
feat: Add NVIDIA DCGM installation
NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health. Based on the current CUDA 12.9 environment, install the corresponding version: datacenter-gpu-manager-4-cuda12. This package requires NVIDIA GPU drivers to be installed. After the package is installed, it configures the nvidia-dcgm system service to monitor GPU status and provides the 'dcgmi' command line tool for users. Signed-off-by: Yaju Cao <yacao@redhat.com>
1 parent c9ee245 commit 195b8b5

File tree

4 files changed

+32
-0
lines changed

4 files changed

+32
-0
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,18 @@ Default: `true`
115115

116116
Type: `bool`
117117

118+
### hpc_install_nvidia_dcgm
119+
120+
Whether to install the NVIDIA datacenter GPU manager(DCGM) and enable its nvidia-dcgm service.
121+
122+
NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments, install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health.
123+
124+
Run `dcgmi` in the GPU nodes, e.g. `dcgmi discovery -l` to list GPUs on the node.
125+
126+
Default: `true`
127+
128+
Type: `bool`
129+
118130
### hpc_install_rdma
119131

120132
Whether to install the NVIDIA RDMA package.

defaults/main.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ hpc_install_cuda_toolkit: true
2121
hpc_install_hpc_nvidia_nccl: true
2222
hpc_install_nvidia_fabric_manager: true
2323
hpc_install_nvidia_imex: true
24+
hpc_install_nvidia_dcgm: true
2425
hpc_install_rdma: true
2526
hpc_enable_azure_persistent_rdma_naming: true
2627
hpc_azure_disable_predictable_net_names: true

tasks/main.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -593,6 +593,23 @@
593593
value: "0"
594594
state: present
595595

596+
- name: Install NVIDIA datacenter GPU manager enable service
597+
when: hpc_install_nvidia_dcgm
598+
block:
599+
- name: Install NVIDIA datacenter GPU manager
600+
package:
601+
name: "{{ __hpc_nvidia_dcgm }}"
602+
state: present
603+
use: "{{ (__hpc_server_is_ostree | d(false)) |
604+
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
605+
register: __hpc_nvidia_dcgm_install
606+
until: __hpc_nvidia_dcgm_install is success
607+
608+
- name: Ensure dcgm service is enabled
609+
service:
610+
name: nvidia-dcgm
611+
enabled: true
612+
596613
- name: Install RDMA packages
597614
when: hpc_install_rdma
598615
block:

vars/RedHat_9.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ __hpc_nvidia_nccl_packages:
3434
__hpc_docker_packages:
3535
- moby-engine-29.1.4-1.el9
3636
- moby-cli-29.1.4-1.el9
37+
__hpc_nvidia_dcgm:
38+
- datacenter-gpu-manager-4-cuda12
3739

3840
# Vars related to building packages from source
3941
__hpc_gdrcopy_info:

0 commit comments

Comments
 (0)