Commit 195b8b5
committed
feat: Add NVIDIA DCGM installation
NVIDIA DCGM is a GPU monitoring and management toolkit for large-scale
GPU deployments. Install DCGM on all GPU nodes in an HPC cluster to
maintain reliability and monitor GPU health.
Based on the current CUDA 12.9 environment, install the corresponding
version: datacenter-gpu-manager-4-cuda12.
This package requires NVIDIA GPU drivers to be installed. After the
package is installed, it configures the nvidia-dcgm system service
to monitor GPU status and provides the 'dcgmi' command line tool
for users.
Signed-off-by: Yaju Cao <yacao@redhat.com>1 parent c9ee245 commit 195b8b5
4 files changed
+32
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
118 | 130 | | |
119 | 131 | | |
120 | 132 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
24 | 25 | | |
25 | 26 | | |
26 | 27 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
593 | 593 | | |
594 | 594 | | |
595 | 595 | | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
596 | 613 | | |
597 | 614 | | |
598 | 615 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
| 38 | + | |
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
| |||
0 commit comments