feat: Add NVIDIA DCGM installation by yacao · Pull Request #100 · linux-system-roles/hpc

yacao · 2026-03-17T08:13:33Z

Enhancement:
Add support for NVIDIA Data Center GPU Manager (DCGM) installation on GPU nodes to enable centralized GPU health monitoring, diagnostics, and telemetry collection in HPC clusters.

Reason:
DCGM provides production-grade GPU observability (health, utilization, diagnostics), which is essential for maintaining reliability and enabling proactive issue detection in large-scale GPU deployments.

Result:

Package datacenter-gpu-manager-4-cuda12 is installed on GPU nodes
NVIDIA DCGM service (nvidia-dcgm) is enabled and started
dcgmi CLI is available for GPU management and diagnostics

Issue Tracker:
RHELHPC-106

Summary by Sourcery

Add optional installation and enablement of NVIDIA Data Center GPU Manager (DCGM) on supported GPU nodes.

New Features:

Introduce configurable flag to install NVIDIA DCGM and enable its system service on GPU nodes.
Expose OS-specific package variable for installing the NVIDIA DCGM package on RHEL 9-based systems.

Documentation:

Document the hpc_install_nvidia_dcgm option and basic dcgmi usage in the role README.

sourcery-ai · 2026-03-17T08:13:40Z

Reviewer's Guide

Adds optional installation and enablement of NVIDIA Data Center GPU Manager (DCGM) on GPU nodes, controlled by a new Ansible variable, and documents the behavior for RHEL 9 HPC environments.

Sequence diagram for conditional NVIDIA DCGM installation via Ansible

sequenceDiagram
    actor Admin
    participant Ansible_Controller
    participant HPC_GPU_Node
    participant Package_Manager
    participant Systemd

    Admin->>Ansible_Controller: Run_HPC_role
    Ansible_Controller->>HPC_GPU_Node: Evaluate_hpc_install_nvidia_dcgm
    alt hpc_install_nvidia_dcgm_true
        Ansible_Controller->>Package_Manager: Install___hpc_nvidia_dcgm
        Package_Manager-->>Ansible_Controller: Install_result
        loop Retry_until_success
            Ansible_Controller->>Package_Manager: Retry_install_if_failed
            Package_Manager-->>Ansible_Controller: Success_or_failure
        end
        Ansible_Controller->>Systemd: Enable_nvidia_dcgm_service
        Systemd-->>Ansible_Controller: Service_enabled
    else hpc_install_nvidia_dcgm_false
        Ansible_Controller-->>HPC_GPU_Node: Skip_DCGM_tasks
    end

File-Level Changes

Change	Details	Files
Introduce configurable DCGM installation and service enablement task block in the main Ansible role workflow.	Add a conditional task block that installs the DCGM package when hpc_install_nvidia_dcgm is true Use the existing ostree-aware package installation pattern via ansible.posix.rhel_rpm_ostree when appropriate Ensure the nvidia-dcgm systemd service is enabled after package installation Make the package installation task retry until it succeeds using Ansible's until mechanism	`tasks/main.yml`
Document and default-enable the new DCGM installation toggle and package mapping for RHEL 9.	Introduce hpc_install_nvidia_dcgm default variable and set its default to true Document the new variable, its purpose, and example dcgmi usage in the role README Define the RHEL 9-specific DCGM package list variable mapping to datacenter-gpu-manager-4-cuda12	`defaults/main.yml` `README.md` `vars/RedHat_9.yml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 3 issues, and left some high level feedback:

The dcgm service task only enables nvidia-dcgm but does not start it; consider adding state: started (or state: started with enabled: true) so behavior matches the PR description of having the service both enabled and started.
The task label Install NVIDIA datacenter GPU manager enable service is a bit unclear; consider rephrasing to something like Install and enable NVIDIA datacenter GPU manager for readability.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The dcgm service task only enables `nvidia-dcgm` but does not start it; consider adding `state: started` (or `state: started` with `enabled: true`) so behavior matches the PR description of having the service both enabled and started.
- The task label `Install NVIDIA datacenter GPU manager enable service` is a bit unclear; consider rephrasing to something like `Install and enable NVIDIA datacenter GPU manager` for readability.

## Individual Comments

### Comment 1
<location path="tasks/main.yml" line_range="443-450" />
<code_context>
+- name: Install NVIDIA datacenter GPU manager enable service
+  when: hpc_install_nvidia_dcgm
+  block:
+    - name: Install NVIDIA datacenter GPU manager
+      package:
+        name: "{{ __hpc_nvidia_dcgm }}"
+        state: present
+        use: "{{ (__hpc_server_is_ostree | d(false)) |
+          ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+      register: __hpc_nvidia_dcgm_install
+      until: __hpc_nvidia_dcgm_install is success
+
+    - name: Ensure dcgm service is enabled
</code_context>
<issue_to_address>
**suggestion:** Add retries/delay to make the `until` loop effective and predictable.

This task uses `until: __hpc_nvidia_dcgm_install is success` but omits `retries` and `delay`, so the loop will not actually retry. Please add explicit values (for example, `retries: 5` and `delay: 10`, or whatever fits your environment) to make the behavior deterministic and resilient to transient repo issues.

```suggestion
    - name: Install NVIDIA datacenter GPU manager
      package:
        name: "{{ __hpc_nvidia_dcgm }}"
        state: present
        use: "{{ (__hpc_server_is_ostree | d(false)) |
          ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
      register: __hpc_nvidia_dcgm_install
      until: __hpc_nvidia_dcgm_install is success
      retries: 5
      delay: 10
```
</issue_to_address>

### Comment 2
<location path="tasks/main.yml" line_range="452-438" />
<code_context>
+      register: __hpc_nvidia_dcgm_install
+      until: __hpc_nvidia_dcgm_install is success
+
+    - name: Ensure dcgm service is enabled
+      service:
+        name: nvidia-dcgm
+        enabled: true
+
+
</code_context>
<issue_to_address>
**question (bug_risk):** Consider also ensuring the dcgm service is started, not only enabled.

This only ensures `nvidia-dcgm` is enabled for future boots and doesn’t guarantee it’s running after the play. If DCGM should be active immediately, consider adding `state: started` (and optionally keeping `enabled: true`) to align with other service management tasks.
</issue_to_address>

### Comment 3
<location path="README.md" line_range="104" />
<code_context>
+
+Whether to install the NVIDIA datacenter GPU manager(DCGM) and  enable its nvidia-dcgm service. 
+
+NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments, install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health.
+
+Run `dcgmi` in the GPU nodes, e.g. `dcgmi discovery -l` to list GPUs on the node.  
</code_context>
<issue_to_address>
**suggestion (typo):** Avoid the comma splice by splitting into two sentences or rephrasing

This sentence combines two independent clauses with only a comma (a comma splice). Please either split it into two sentences or add a conjunction, for example:

"NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health."

```suggestion
NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-03-17T08:14:53Z

tasks/main.yml

+    - name: Install NVIDIA datacenter GPU manager
+      package:
+        name: "{{ __hpc_nvidia_dcgm }}"
+        state: present
+        use: "{{ (__hpc_server_is_ostree | d(false)) |
+          ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+      register: __hpc_nvidia_dcgm_install
+      until: __hpc_nvidia_dcgm_install is success


suggestion: Add retries/delay to make the until loop effective and predictable.

This task uses until: __hpc_nvidia_dcgm_install is success but omits retries and delay, so the loop will not actually retry. Please add explicit values (for example, retries: 5 and delay: 10, or whatever fits your environment) to make the behavior deterministic and resilient to transient repo issues.

Suggested change

- name: Install NVIDIA datacenter GPU manager

package:

name: "{{ __hpc_nvidia_dcgm }}"

state: present

use: "{{ (__hpc_server_is_ostree | d(false)) |

ternary('ansible.posix.rhel_rpm_ostree', omit) }}"

register: __hpc_nvidia_dcgm_install

until: __hpc_nvidia_dcgm_install is success

- name: Install NVIDIA datacenter GPU manager

package:

name: "{{ __hpc_nvidia_dcgm }}"

state: present

use: "{{ (__hpc_server_is_ostree | d(false)) |

ternary('ansible.posix.rhel_rpm_ostree', omit) }}"

register: __hpc_nvidia_dcgm_install

until: __hpc_nvidia_dcgm_install is success

retries: 5

delay: 10

tasks/main.yml

README.md

tasks/main.yml

dgchinner · 2026-03-18T06:58:09Z

The last commit has a typo in the title and an empty commit message without a sign-off, so that needs fixing. Actually, it appears that is it just fixing a typo in teh original commit, so you should just fold that back into the original commit and push the branch again.

That would then get rid of the problematic back-merge of the main branch that splits the two commits. Back merges mess up teh git history, and cause merge conflict problems as the new code is not built directly on top of the existing tree. IOWs, you should rebase your working branches on main to update them, never back merge main into your working branch.

FWIW, the back merge of main is why the current status of the PR is "This branch cannot be rebased due to conflicts".....

yacao · 2026-03-18T07:17:10Z

The last commit has a typo in the title and an empty commit message without a sign-off, so that needs fixing. Actually, it appears that is it just fixing a typo in teh original commit, so you should just fold that back into the original commit and push the branch again.

That would then get rid of the problematic back-merge of the main branch that splits the two commits. Back merges mess up teh git history, and cause merge conflict problems as the new code is not built directly on top of the existing tree. IOWs, you should rebase your working branches on main to update them, never back merge main into your working branch.

FWIW, the back merge of main is why the current status of the PR is "This branch cannot be rebased due to conflicts".....

Thanks @dgchinner for your review and suggestion! I will follow this workflow in the future~ Updating the last commit

dgchinner · 2026-03-18T23:15:34Z

That would then get rid of the problematic back-merge of the main branch that splits the two commits. Back merges mess up teh git history, and cause merge conflict problems as the new code is not built directly on top of the existing tree. IOWs, you should rebase your working branches on main to update them, never back merge main into your working branch.
FWIW, the back merge of main is why the current status of the PR is "This branch cannot be rebased due to conflicts".....

Thanks @dgchinner for your review and suggestion! I will follow this workflow in the future~ Updating the last commit

Good, but you still need to rebase the commits in this PR on the current main branch before this can be merged.

yacao · 2026-03-19T07:35:42Z

@richm can we get this merged now?

README.md

richm · 2026-03-20T12:01:37Z

@yacao you'll need to rebase on top of the latest main branch and resolve the conflicts

NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health. Based on the current CUDA 12.9 environment, install the corresponding version: datacenter-gpu-manager-4-cuda12. This package requires NVIDIA GPU drivers to be installed. After the package is installed, it configures the nvidia-dcgm system service to monitor GPU status and provides the 'dcgmi' command line tool for users. Signed-off-by: Yaju Cao <yacao@redhat.com>

yacao · 2026-03-20T13:09:34Z

@yacao you'll need to rebase on top of the latest main branch and resolve the conflicts

Updated, thanks!

yacao requested review from richm and spetrosi as code owners March 17, 2026 08:13

sourcery-ai bot reviewed Mar 17, 2026

View reviewed changes

richm reviewed Mar 17, 2026

View reviewed changes

tasks/main.yml Show resolved Hide resolved

richm reviewed Mar 17, 2026

View reviewed changes

tasks/main.yml Show resolved Hide resolved

richm reviewed Mar 17, 2026

View reviewed changes

tasks/main.yml Outdated Show resolved Hide resolved

yacao force-pushed the add-dcgm branch 2 times, most recently from fae3bf4 to 825f9bd Compare March 17, 2026 16:29

yacao force-pushed the add-dcgm branch from 825f9bd to 3123e76 Compare March 18, 2026 07:09

yacao force-pushed the add-dcgm branch from 3123e76 to 5b3ce10 Compare March 18, 2026 07:21

richm reviewed Mar 19, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

richm reviewed Mar 19, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

richm reviewed Mar 19, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

yacao force-pushed the add-dcgm branch 3 times, most recently from 5d02da7 to 6c1711e Compare March 20, 2026 06:22

spetrosi approved these changes Mar 20, 2026

View reviewed changes

yacao force-pushed the add-dcgm branch from 6c1711e to 195b8b5 Compare March 20, 2026 13:08

richm approved these changes Mar 20, 2026

View reviewed changes

richm merged commit 271a2fe into linux-system-roles:main Mar 20, 2026
22 checks passed

yacao deleted the add-dcgm branch March 23, 2026 08:06

Conversation

yacao commented Mar 17, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for conditional NVIDIA DCGM installation via Ansible

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dgchinner commented Mar 18, 2026

Uh oh!

yacao commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dgchinner commented Mar 18, 2026

Uh oh!

yacao commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richm commented Mar 20, 2026

Uh oh!

yacao commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yacao commented Mar 17, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Mar 17, 2026 •

edited

Loading

yacao commented Mar 18, 2026 •

edited

Loading