feat: Add NVIDIA DCGM installation#100
Conversation
Reviewer's GuideAdds optional installation and enablement of NVIDIA Data Center GPU Manager (DCGM) on GPU nodes, controlled by a new Ansible variable, and documents the behavior for RHEL 9 HPC environments. Sequence diagram for conditional NVIDIA DCGM installation via AnsiblesequenceDiagram
actor Admin
participant Ansible_Controller
participant HPC_GPU_Node
participant Package_Manager
participant Systemd
Admin->>Ansible_Controller: Run_HPC_role
Ansible_Controller->>HPC_GPU_Node: Evaluate_hpc_install_nvidia_dcgm
alt hpc_install_nvidia_dcgm_true
Ansible_Controller->>Package_Manager: Install___hpc_nvidia_dcgm
Package_Manager-->>Ansible_Controller: Install_result
loop Retry_until_success
Ansible_Controller->>Package_Manager: Retry_install_if_failed
Package_Manager-->>Ansible_Controller: Success_or_failure
end
Ansible_Controller->>Systemd: Enable_nvidia_dcgm_service
Systemd-->>Ansible_Controller: Service_enabled
else hpc_install_nvidia_dcgm_false
Ansible_Controller-->>HPC_GPU_Node: Skip_DCGM_tasks
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- The dcgm service task only enables
nvidia-dcgmbut does not start it; consider addingstate: started(orstate: startedwithenabled: true) so behavior matches the PR description of having the service both enabled and started. - The task label
Install NVIDIA datacenter GPU manager enable serviceis a bit unclear; consider rephrasing to something likeInstall and enable NVIDIA datacenter GPU managerfor readability.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The dcgm service task only enables `nvidia-dcgm` but does not start it; consider adding `state: started` (or `state: started` with `enabled: true`) so behavior matches the PR description of having the service both enabled and started.
- The task label `Install NVIDIA datacenter GPU manager enable service` is a bit unclear; consider rephrasing to something like `Install and enable NVIDIA datacenter GPU manager` for readability.
## Individual Comments
### Comment 1
<location path="tasks/main.yml" line_range="443-450" />
<code_context>
+- name: Install NVIDIA datacenter GPU manager enable service
+ when: hpc_install_nvidia_dcgm
+ block:
+ - name: Install NVIDIA datacenter GPU manager
+ package:
+ name: "{{ __hpc_nvidia_dcgm }}"
+ state: present
+ use: "{{ (__hpc_server_is_ostree | d(false)) |
+ ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+ register: __hpc_nvidia_dcgm_install
+ until: __hpc_nvidia_dcgm_install is success
+
+ - name: Ensure dcgm service is enabled
</code_context>
<issue_to_address>
**suggestion:** Add retries/delay to make the `until` loop effective and predictable.
This task uses `until: __hpc_nvidia_dcgm_install is success` but omits `retries` and `delay`, so the loop will not actually retry. Please add explicit values (for example, `retries: 5` and `delay: 10`, or whatever fits your environment) to make the behavior deterministic and resilient to transient repo issues.
```suggestion
- name: Install NVIDIA datacenter GPU manager
package:
name: "{{ __hpc_nvidia_dcgm }}"
state: present
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
register: __hpc_nvidia_dcgm_install
until: __hpc_nvidia_dcgm_install is success
retries: 5
delay: 10
```
</issue_to_address>
### Comment 2
<location path="tasks/main.yml" line_range="452-438" />
<code_context>
+ register: __hpc_nvidia_dcgm_install
+ until: __hpc_nvidia_dcgm_install is success
+
+ - name: Ensure dcgm service is enabled
+ service:
+ name: nvidia-dcgm
+ enabled: true
+
+
</code_context>
<issue_to_address>
**question (bug_risk):** Consider also ensuring the dcgm service is started, not only enabled.
This only ensures `nvidia-dcgm` is enabled for future boots and doesn’t guarantee it’s running after the play. If DCGM should be active immediately, consider adding `state: started` (and optionally keeping `enabled: true`) to align with other service management tasks.
</issue_to_address>
### Comment 3
<location path="README.md" line_range="104" />
<code_context>
+
+Whether to install the NVIDIA datacenter GPU manager(DCGM) and enable its nvidia-dcgm service.
+
+NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments, install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health.
+
+Run `dcgmi` in the GPU nodes, e.g. `dcgmi discovery -l` to list GPUs on the node.
</code_context>
<issue_to_address>
**suggestion (typo):** Avoid the comma splice by splitting into two sentences or rephrasing
This sentence combines two independent clauses with only a comma (a comma splice). Please either split it into two sentences or add a conjunction, for example:
"NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health."
```suggestion
NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| - name: Install NVIDIA datacenter GPU manager | ||
| package: | ||
| name: "{{ __hpc_nvidia_dcgm }}" | ||
| state: present | ||
| use: "{{ (__hpc_server_is_ostree | d(false)) | | ||
| ternary('ansible.posix.rhel_rpm_ostree', omit) }}" | ||
| register: __hpc_nvidia_dcgm_install | ||
| until: __hpc_nvidia_dcgm_install is success |
There was a problem hiding this comment.
suggestion: Add retries/delay to make the until loop effective and predictable.
This task uses until: __hpc_nvidia_dcgm_install is success but omits retries and delay, so the loop will not actually retry. Please add explicit values (for example, retries: 5 and delay: 10, or whatever fits your environment) to make the behavior deterministic and resilient to transient repo issues.
| - name: Install NVIDIA datacenter GPU manager | |
| package: | |
| name: "{{ __hpc_nvidia_dcgm }}" | |
| state: present | |
| use: "{{ (__hpc_server_is_ostree | d(false)) | | |
| ternary('ansible.posix.rhel_rpm_ostree', omit) }}" | |
| register: __hpc_nvidia_dcgm_install | |
| until: __hpc_nvidia_dcgm_install is success | |
| - name: Install NVIDIA datacenter GPU manager | |
| package: | |
| name: "{{ __hpc_nvidia_dcgm }}" | |
| state: present | |
| use: "{{ (__hpc_server_is_ostree | d(false)) | | |
| ternary('ansible.posix.rhel_rpm_ostree', omit) }}" | |
| register: __hpc_nvidia_dcgm_install | |
| until: __hpc_nvidia_dcgm_install is success | |
| retries: 5 | |
| delay: 10 |
fae3bf4 to
825f9bd
Compare
|
The last commit has a typo in the title and an empty commit message without a sign-off, so that needs fixing. Actually, it appears that is it just fixing a typo in teh original commit, so you should just fold that back into the original commit and push the branch again. That would then get rid of the problematic back-merge of the main branch that splits the two commits. Back merges mess up teh git history, and cause merge conflict problems as the new code is not built directly on top of the existing tree. IOWs, you should rebase your working branches on main to update them, never back merge main into your working branch. FWIW, the back merge of main is why the current status of the PR is "This branch cannot be rebased due to conflicts"..... |
Thanks @dgchinner for your review and suggestion! I will follow this workflow in the future~ Updating the last commit |
Good, but you still need to rebase the commits in this PR on the current main branch before this can be merged. |
|
@richm can we get this merged now? |
5d02da7 to
6c1711e
Compare
|
@yacao you'll need to rebase on top of the latest |
NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health. Based on the current CUDA 12.9 environment, install the corresponding version: datacenter-gpu-manager-4-cuda12. This package requires NVIDIA GPU drivers to be installed. After the package is installed, it configures the nvidia-dcgm system service to monitor GPU status and provides the 'dcgmi' command line tool for users. Signed-off-by: Yaju Cao <yacao@redhat.com>
Updated, thanks! |
Enhancement:
Add support for NVIDIA Data Center GPU Manager (DCGM) installation on GPU nodes to enable centralized GPU health monitoring, diagnostics, and telemetry collection in HPC clusters.
Reason:
DCGM provides production-grade GPU observability (health, utilization, diagnostics), which is essential for maintaining reliability and enabling proactive issue detection in large-scale GPU deployments.
Result:
Issue Tracker:
RHELHPC-106
Summary by Sourcery
Add optional installation and enablement of NVIDIA Data Center GPU Manager (DCGM) on supported GPU nodes.
New Features:
Documentation: