Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,18 @@ Default: `true`

Type: `bool`

### hpc_install_nvidia_dcgm

Whether to install the NVIDIA datacenter GPU manager(DCGM) and enable its nvidia-dcgm service.

NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments, install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health.

Run `dcgmi` in the GPU nodes, e.g. `dcgmi discovery -l` to list GPUs on the node.

Default: `true`

Type: `bool`

### hpc_install_rdma

Whether to install the NVIDIA RDMA package.
Expand Down
1 change: 1 addition & 0 deletions defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ hpc_install_cuda_toolkit: true
hpc_install_hpc_nvidia_nccl: true
hpc_install_nvidia_fabric_manager: true
hpc_install_nvidia_imex: true
hpc_install_nvidia_dcgm: true
hpc_install_rdma: true
hpc_enable_azure_persistent_rdma_naming: true
hpc_azure_disable_predictable_net_names: true
Expand Down
17 changes: 17 additions & 0 deletions tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -593,6 +593,23 @@
value: "0"
state: present

- name: Install NVIDIA datacenter GPU manager enable service
when: hpc_install_nvidia_dcgm
block:
- name: Install NVIDIA datacenter GPU manager
package:
name: "{{ __hpc_nvidia_dcgm }}"
state: present
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
register: __hpc_nvidia_dcgm_install
until: __hpc_nvidia_dcgm_install is success

- name: Ensure dcgm service is enabled
service:
name: nvidia-dcgm
enabled: true

- name: Install RDMA packages
when: hpc_install_rdma
block:
Expand Down
2 changes: 2 additions & 0 deletions vars/RedHat_9.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
__hpc_docker_packages:
- moby-engine-29.1.4-1.el9
- moby-cli-29.1.4-1.el9
__hpc_nvidia_dcgm:
- datacenter-gpu-manager-4-cuda12

# Vars related to building packages from source
__hpc_gdrcopy_info:
Expand Down Expand Up @@ -75,4 +77,4 @@
__hpc_kvp_client_info:
name: kvp_client.c
sha256: 0f7fa598c40994e9a18fa2982ee6ca9a591bd2ebc419f1ec526ea9bf19aae1eb
url: https://raw.githubusercontent.com/microsoft/lis-test/master/WS2012R2/lisa/tools/KVP/kvp_client.c

Check warning on line 80 in vars/RedHat_9.yml

View workflow job for this annotation

GitHub Actions / Detect non-inclusive language

`master` may be insensitive, use `primary`, `source`, `initiator,requester`, `controller,host`, `director` instead
Loading