Skip to content

feat: Add NVIDIA DCGM installation#100

Merged
richm merged 1 commit intolinux-system-roles:mainfrom
yacao:add-dcgm
Mar 20, 2026
Merged

feat: Add NVIDIA DCGM installation#100
richm merged 1 commit intolinux-system-roles:mainfrom
yacao:add-dcgm

Conversation

@yacao
Copy link
Collaborator

@yacao yacao commented Mar 17, 2026

Enhancement:
Add support for NVIDIA Data Center GPU Manager (DCGM) installation on GPU nodes to enable centralized GPU health monitoring, diagnostics, and telemetry collection in HPC clusters.

Reason:
DCGM provides production-grade GPU observability (health, utilization, diagnostics), which is essential for maintaining reliability and enabling proactive issue detection in large-scale GPU deployments.

Result:

  • Package datacenter-gpu-manager-4-cuda12 is installed on GPU nodes
  • NVIDIA DCGM service (nvidia-dcgm) is enabled and started
  • dcgmi CLI is available for GPU management and diagnostics

Issue Tracker:
RHELHPC-106

Summary by Sourcery

Add optional installation and enablement of NVIDIA Data Center GPU Manager (DCGM) on supported GPU nodes.

New Features:

  • Introduce configurable flag to install NVIDIA DCGM and enable its system service on GPU nodes.
  • Expose OS-specific package variable for installing the NVIDIA DCGM package on RHEL 9-based systems.

Documentation:

  • Document the hpc_install_nvidia_dcgm option and basic dcgmi usage in the role README.

@yacao yacao requested review from richm and spetrosi as code owners March 17, 2026 08:13
@sourcery-ai
Copy link

sourcery-ai bot commented Mar 17, 2026

Reviewer's Guide

Adds optional installation and enablement of NVIDIA Data Center GPU Manager (DCGM) on GPU nodes, controlled by a new Ansible variable, and documents the behavior for RHEL 9 HPC environments.

Sequence diagram for conditional NVIDIA DCGM installation via Ansible

sequenceDiagram
    actor Admin
    participant Ansible_Controller
    participant HPC_GPU_Node
    participant Package_Manager
    participant Systemd

    Admin->>Ansible_Controller: Run_HPC_role
    Ansible_Controller->>HPC_GPU_Node: Evaluate_hpc_install_nvidia_dcgm
    alt hpc_install_nvidia_dcgm_true
        Ansible_Controller->>Package_Manager: Install___hpc_nvidia_dcgm
        Package_Manager-->>Ansible_Controller: Install_result
        loop Retry_until_success
            Ansible_Controller->>Package_Manager: Retry_install_if_failed
            Package_Manager-->>Ansible_Controller: Success_or_failure
        end
        Ansible_Controller->>Systemd: Enable_nvidia_dcgm_service
        Systemd-->>Ansible_Controller: Service_enabled
    else hpc_install_nvidia_dcgm_false
        Ansible_Controller-->>HPC_GPU_Node: Skip_DCGM_tasks
    end
Loading

File-Level Changes

Change Details Files
Introduce configurable DCGM installation and service enablement task block in the main Ansible role workflow.
  • Add a conditional task block that installs the DCGM package when hpc_install_nvidia_dcgm is true
  • Use the existing ostree-aware package installation pattern via ansible.posix.rhel_rpm_ostree when appropriate
  • Ensure the nvidia-dcgm systemd service is enabled after package installation
  • Make the package installation task retry until it succeeds using Ansible's until mechanism
tasks/main.yml
Document and default-enable the new DCGM installation toggle and package mapping for RHEL 9.
  • Introduce hpc_install_nvidia_dcgm default variable and set its default to true
  • Document the new variable, its purpose, and example dcgmi usage in the role README
  • Define the RHEL 9-specific DCGM package list variable mapping to datacenter-gpu-manager-4-cuda12
defaults/main.yml
README.md
vars/RedHat_9.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • The dcgm service task only enables nvidia-dcgm but does not start it; consider adding state: started (or state: started with enabled: true) so behavior matches the PR description of having the service both enabled and started.
  • The task label Install NVIDIA datacenter GPU manager enable service is a bit unclear; consider rephrasing to something like Install and enable NVIDIA datacenter GPU manager for readability.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The dcgm service task only enables `nvidia-dcgm` but does not start it; consider adding `state: started` (or `state: started` with `enabled: true`) so behavior matches the PR description of having the service both enabled and started.
- The task label `Install NVIDIA datacenter GPU manager enable service` is a bit unclear; consider rephrasing to something like `Install and enable NVIDIA datacenter GPU manager` for readability.

## Individual Comments

### Comment 1
<location path="tasks/main.yml" line_range="443-450" />
<code_context>
+- name: Install NVIDIA datacenter GPU manager enable service
+  when: hpc_install_nvidia_dcgm
+  block:
+    - name: Install NVIDIA datacenter GPU manager
+      package:
+        name: "{{ __hpc_nvidia_dcgm }}"
+        state: present
+        use: "{{ (__hpc_server_is_ostree | d(false)) |
+          ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
+      register: __hpc_nvidia_dcgm_install
+      until: __hpc_nvidia_dcgm_install is success
+
+    - name: Ensure dcgm service is enabled
</code_context>
<issue_to_address>
**suggestion:** Add retries/delay to make the `until` loop effective and predictable.

This task uses `until: __hpc_nvidia_dcgm_install is success` but omits `retries` and `delay`, so the loop will not actually retry. Please add explicit values (for example, `retries: 5` and `delay: 10`, or whatever fits your environment) to make the behavior deterministic and resilient to transient repo issues.

```suggestion
    - name: Install NVIDIA datacenter GPU manager
      package:
        name: "{{ __hpc_nvidia_dcgm }}"
        state: present
        use: "{{ (__hpc_server_is_ostree | d(false)) |
          ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
      register: __hpc_nvidia_dcgm_install
      until: __hpc_nvidia_dcgm_install is success
      retries: 5
      delay: 10
```
</issue_to_address>

### Comment 2
<location path="tasks/main.yml" line_range="452-438" />
<code_context>
+      register: __hpc_nvidia_dcgm_install
+      until: __hpc_nvidia_dcgm_install is success
+
+    - name: Ensure dcgm service is enabled
+      service:
+        name: nvidia-dcgm
+        enabled: true
+
+
</code_context>
<issue_to_address>
**question (bug_risk):** Consider also ensuring the dcgm service is started, not only enabled.

This only ensures `nvidia-dcgm` is enabled for future boots and doesn’t guarantee it’s running after the play. If DCGM should be active immediately, consider adding `state: started` (and optionally keeping `enabled: true`) to align with other service management tasks.
</issue_to_address>

### Comment 3
<location path="README.md" line_range="104" />
<code_context>
+
+Whether to install the NVIDIA datacenter GPU manager(DCGM) and  enable its nvidia-dcgm service. 
+
+NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments, install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health.
+
+Run `dcgmi` in the GPU nodes, e.g. `dcgmi discovery -l` to list GPUs on the node.  
</code_context>
<issue_to_address>
**suggestion (typo):** Avoid the comma splice by splitting into two sentences or rephrasing

This sentence combines two independent clauses with only a comma (a comma splice). Please either split it into two sentences or add a conjunction, for example:

"NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health."

```suggestion
NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to maintain reliability and monitor GPU health.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +443 to +450
- name: Install NVIDIA datacenter GPU manager
package:
name: "{{ __hpc_nvidia_dcgm }}"
state: present
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
register: __hpc_nvidia_dcgm_install
until: __hpc_nvidia_dcgm_install is success
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Add retries/delay to make the until loop effective and predictable.

This task uses until: __hpc_nvidia_dcgm_install is success but omits retries and delay, so the loop will not actually retry. Please add explicit values (for example, retries: 5 and delay: 10, or whatever fits your environment) to make the behavior deterministic and resilient to transient repo issues.

Suggested change
- name: Install NVIDIA datacenter GPU manager
package:
name: "{{ __hpc_nvidia_dcgm }}"
state: present
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
register: __hpc_nvidia_dcgm_install
until: __hpc_nvidia_dcgm_install is success
- name: Install NVIDIA datacenter GPU manager
package:
name: "{{ __hpc_nvidia_dcgm }}"
state: present
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
register: __hpc_nvidia_dcgm_install
until: __hpc_nvidia_dcgm_install is success
retries: 5
delay: 10

@yacao yacao force-pushed the add-dcgm branch 2 times, most recently from fae3bf4 to 825f9bd Compare March 17, 2026 16:29
@dgchinner
Copy link
Collaborator

The last commit has a typo in the title and an empty commit message without a sign-off, so that needs fixing. Actually, it appears that is it just fixing a typo in teh original commit, so you should just fold that back into the original commit and push the branch again.

That would then get rid of the problematic back-merge of the main branch that splits the two commits. Back merges mess up teh git history, and cause merge conflict problems as the new code is not built directly on top of the existing tree. IOWs, you should rebase your working branches on main to update them, never back merge main into your working branch.

FWIW, the back merge of main is why the current status of the PR is "This branch cannot be rebased due to conflicts".....

@yacao
Copy link
Collaborator Author

yacao commented Mar 18, 2026

The last commit has a typo in the title and an empty commit message without a sign-off, so that needs fixing. Actually, it appears that is it just fixing a typo in teh original commit, so you should just fold that back into the original commit and push the branch again.

That would then get rid of the problematic back-merge of the main branch that splits the two commits. Back merges mess up teh git history, and cause merge conflict problems as the new code is not built directly on top of the existing tree. IOWs, you should rebase your working branches on main to update them, never back merge main into your working branch.

FWIW, the back merge of main is why the current status of the PR is "This branch cannot be rebased due to conflicts".....

Thanks @dgchinner for your review and suggestion! I will follow this workflow in the future~ Updating the last commit

@dgchinner
Copy link
Collaborator

That would then get rid of the problematic back-merge of the main branch that splits the two commits. Back merges mess up teh git history, and cause merge conflict problems as the new code is not built directly on top of the existing tree. IOWs, you should rebase your working branches on main to update them, never back merge main into your working branch.
FWIW, the back merge of main is why the current status of the PR is "This branch cannot be rebased due to conflicts".....

Thanks @dgchinner for your review and suggestion! I will follow this workflow in the future~ Updating the last commit

Good, but you still need to rebase the commits in this PR on the current main branch before this can be merged.

@yacao
Copy link
Collaborator Author

yacao commented Mar 19, 2026

@richm can we get this merged now?

@yacao yacao force-pushed the add-dcgm branch 3 times, most recently from 5d02da7 to 6c1711e Compare March 20, 2026 06:22
@richm
Copy link
Contributor

richm commented Mar 20, 2026

@yacao you'll need to rebase on top of the latest main branch and resolve the conflicts

NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale
GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to
maintain reliability and monitor GPU health.

Based on the current CUDA 12.9 environment, install the corresponding
version: datacenter-gpu-manager-4-cuda12.

This package requires NVIDIA GPU drivers to be installed. After the
package is installed, it configures the nvidia-dcgm system service
to monitor GPU status and provides the 'dcgmi' command line tool
for users.

Signed-off-by: Yaju Cao <yacao@redhat.com>
@yacao
Copy link
Collaborator Author

yacao commented Mar 20, 2026

@yacao you'll need to rebase on top of the latest main branch and resolve the conflicts

Updated, thanks!

@richm richm merged commit 271a2fe into linux-system-roles:main Mar 20, 2026
22 checks passed
@yacao yacao deleted the add-dcgm branch March 23, 2026 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants