NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768

computate · 2024-10-11T15:37:10Z

I was trying to figure out why the wrk-99 node in nerc-ocp-prod has 4 GPU devices, but the GPU utilization for wrk-99 on nerc-ocp-prod is not available.

I found that the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-prod is in an OperandNotReady status, but the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-test is in a Ready state.

I also noticed that the plugin-validation container of the nvidia-operator-validator-nnjjx pod in the nvidia-gpu-operator namespace, this container is not becoming ready and has a repeated erro in the log:

time="2024-10-11T15:35:31Z" level=info msg="pod nvidia-device-plugin-validator-6sb75 is curently in Failed phase"

I don't know the reason for this error with the gpu-cluster-policy in nerc-ocp-prod.

The text was updated successfully, but these errors were encountered:

computate · 2024-10-11T15:58:37Z

The pod that is failing is on wrk-88 if that makes a difference, the node looks Ready.

computate · 2024-10-14T21:17:10Z

This is also happening in the nerc-ocp-test cluster now with our only GPU there. I'm pretty sure that means we can't use the GPU because the drivers are not installed on the node wrk-3 because of this.

$ oc -n nvidia-gpu-operator logs -l app=nvidia-device-plugin-validator -c plugin-validation
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

computate · 2024-10-15T17:22:48Z

I created this Red Hat support case to address this.

StHeck · 2024-10-16T19:13:53Z

A google of the error code led me to this:
https://stackoverflow.com/questions/3253257/cuda-driver-version-is-insufficient-for-cuda-runtime-version

Which has this comment:
"-> CUDA driver version is insufficient for CUDA runtime version

But this error is misleading, by selecting back the NVIDIA(Performance mode) with nvidia-settings utility the problem disappears.

It is not a version problem."

Are the GPUs in power saving mode?

dystewart · 2024-10-21T01:23:28Z

@StHeck yeah you're right, just confirmed it's not a version error: https://docs.nvidia.com/deploy/cuda-compatibility/#cuda-11-and-later-defaults-to-minor-version-compatibility

According to nvidia-smi we have a valid config:

| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |

dystewart · 2024-10-21T20:36:31Z

Looked at all workloads on wrk-3 node in the nerc-ocp-test cluster, to try and rule out race conditions preventing the GPU operator validator from starting properly. No workloads are competing for the GPU however.

Tried deleting and replacing the clusterPolicy with the following specs:

{
  "apiVersion": "nvidia.com/v1",
  "kind": "ClusterPolicy",
  "metadata": {
    "name": "gpu-cluster-policy"
  },
  "spec": {
    "operator": {
      "defaultRuntime": "crio",
      "use_ocp_driver_toolkit": true,
      "initContainer": {}
    },
    "sandboxWorkloads": {
      "enabled": false,
      "defaultWorkload": "container"
    },
    "driver": {
      "enabled": true,
      "useNvidiaDriverCRD": false,
      "useOpenKernelModules": false,
      "upgradePolicy": {
        "autoUpgrade": true,
        "drain": {
          "deleteEmptyDir": false,
          "enable": false,
          "force": false,
          "timeoutSeconds": 300
        },
        "maxParallelUpgrades": 1,
        "maxUnavailable": "25%",
        "podDeletion": {
          "deleteEmptyDir": false,
          "force": false,
          "timeoutSeconds": 300
        },
        "waitForCompletion": {
          "timeoutSeconds": 0
        }
      },
      "repoConfig": {
        "configMapName": ""
      },
      "certConfig": {
        "name": ""
      },
      "licensingConfig": {
        "nlsEnabled": true,
        "configMapName": ""
      },
      "virtualTopology": {
        "config": ""
      },
      "kernelModuleConfig": {
        "name": ""
      }
    },
    "dcgmExporter": {
      "enabled": true,
      "config": {
        "name": ""
      },
      "serviceMonitor": {
        "enabled": true
      }
    },
    "dcgm": {
      "enabled": true
    },
    "daemonsets": {
      "updateStrategy": "RollingUpdate",
      "rollingUpdate": {
        "maxUnavailable": "1"
      }
    },
    "devicePlugin": {
      "enabled": true,
      "config": {
        "name": "",
        "default": ""
      },
      "mps": {
        "root": "/run/nvidia/mps"
      }
    },
    "gfd": {
      "enabled": true
    },
    "migManager": {
      "enabled": true
    },
    "nodeStatusExporter": {
      "enabled": true
    },
    "mig": {
      "strategy": "single"
    },
    "toolkit": {
      "enabled": true
    },
    "validator": {
      "plugin": {
        "env": [
          {
            "name": "WITH_WORKLOAD",
            "value": "false"
          }
        ]
      }
    },
    "vgpuManager": {
      "enabled": false
    },
    "vgpuDeviceManager": {
      "enabled": true
    },
    "sandboxDevicePlugin": {
      "enabled": true
    },
    "vfioManager": {
      "enabled": true
    },
    "gds": {
      "enabled": false
    },
    "gdrcopy": {
      "enabled": false
    }
  }
}

No luck with the clusterPolicy. We have passed along the must gather to NVIDIA and are awaiting a response from them.

FYI I have disabled auto-sync within ArgoCD while we play around with these resources.

dystewart · 2024-10-22T13:41:02Z

Looks like this is also related: #782

joachimweyl · 2024-10-22T14:14:36Z

@computate can we remove " in nerc-ocp-prod" from the title since now we see the same issue in nerc-ocp-test?

computate · 2024-10-22T17:42:48Z

Done changing the title @joachimweyl

schwesig · 2024-10-24T17:03:44Z

as of today 2024-10-24 12:55 ET
nvidia operator version and last update
call with @tssala23 @dystewart @schwesig

cluster	version	last update	error	machine config update	model failing	nodes
prod	24.6.2	Sep 25.	yes	no	A100 yes, V100 no	wrk-97
test-2 (kruize)	24.3.0	Oct 21.	yes	yes Oct 20.	A100 yes, V100 n/a	MOC-R8PAC23U39
beta	24.6.2		no	yes Oct 4.	A100 no, V100 n/a
test	24.6.2		yes	no	A100 yes, V100 yes	MOC-R8PAC23U27,

vague error message
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! [Vector addition of 50000 elements]

schwesig · 2024-10-24T17:17:20Z

idea for next step:
@taj Salawu removing and readding a GPU node on kruize (test-2) to try a fresh restart

joachimweyl · 2024-10-25T18:45:20Z

@schwesig do we know what nodes specifically are having these issues? Can you update the table above to include a column for node names?

schwesig · 2024-10-29T12:01:42Z

https://console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org/k8s/cluster/nodes/wrk-5/yaml
nerc-ocp-test-2
wrk-5

A suggested fix from NVIDIA for the drivers that were going missing on the GPU nodes in the prod, test, and test-2 clusters. Fixes nerc-project/operations#768

joachimweyl · 2024-11-01T12:32:08Z

@computate how many GPUs were out of order? am I correct that they were out of order from Oct 11th - 31st?

computate · 2024-11-01T16:42:41Z

@joachimweyl Correct about Oct 11th - 31st. I understand there were 2 GPU nodes broken on the test cluster (wrk-3, wrk-4), 2 GPU nodes broken on the prod cluster (wrk-97, wrk-99), and 1 GPU node broken with 4 GPU slices affected on cluster test-2 (wrk-5).

joachimweyl · 2024-11-01T17:17:00Z

@computate do we know which ones were V100 and which were A100? Actually what would be the most helpful is if we know the node name such as test was using MOC-R8PAC23U27 & test-2 was using MOC-R8PAC23U39, do we know what Prod was using?

schwesig · 2024-11-05T10:31:06Z

kruize
nerc-ocp-test-2.nerc.mghpcc.org
applied the changes manually, and it works for them now again

computate · 2024-11-06T22:11:14Z

Running with these gpu-cluster-policy settings in prod is allowing the drivers to be installed on 9/11 nodes so far.

    manager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"
    repoConfig:
      configMapName: ""
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: true
        force: true
        timeoutSeconds: 300

computate · 2024-11-06T22:14:36Z

Here is the list of nodes where the NVIDIA driver is successfully applied now.

$ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep Running
nvidia-operator-validator-697f5                       1/1     Running                    0                116m    10.131.23.245   wrk-99    <none>           <none>
nvidia-operator-validator-8nm2q                       1/1     Running                    0                117m    10.130.19.118   wrk-105   <none>           <none>
nvidia-operator-validator-dd6xw                       1/1     Running                    0                116m    10.129.19.48    wrk-102   <none>           <none>
nvidia-operator-validator-gvlgd                       1/1     Running                    0                115m    10.128.23.147   wrk-107   <none>           <none>
nvidia-operator-validator-kqdwr                       1/1     Running                    0                117m    10.130.24.128   wrk-108   <none>           <none>
nvidia-operator-validator-lpsvx                       1/1     Running                    0                116m    10.129.23.245   wrk-106   <none>           <none>
nvidia-operator-validator-n2pbh                       1/1     Running                    0                112m    10.130.12.56    wrk-88    <none>           <none>
nvidia-operator-validator-ndnkc                       1/1     Running                    0                116m    10.129.24.193   wrk-104   <none>           <none>
nvidia-operator-validator-z8zs8                       1/1     Running                    0                117m    10.128.24.144   wrk-103   <none>           <none>

Currently the NVIDIA driver is failing on wrk-97, but completed on wrk-89.

oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-driver-daemonset | grep -v Running
nvidia-driver-daemonset-415.92.202407191425-0-l6mtg   0/2     Init:CrashLoopBackOff      4 (16s ago)      7m2s   10.129.21.35    wrk-97    <none>           <none>

The NVIDIA driver validation is failing on wrk-97 and wrk-89.

$ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep -v Running
nvidia-operator-validator-fsbfj                       0/1     Init:0/4                   0                71s     10.129.21.78    wrk-97    <none>           <none>
nvidia-operator-validator-hr5pg                       0/1     Init:3/4                   19 (4m21s ago)   115m    10.131.11.208   wrk-89    <none>           <none>

computate · 2024-11-07T18:50:28Z

Here are some useful commands I learned for viewing GPU utilization.

$ oc --as system:admin get node wrk-89 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-89 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
0
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
0
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-89) -n nvidia-gpu-operator -- nvidia-smi
Thu Nov  7 18:48:25 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           On  |   00000000:3B:00.0 Off |                    0 |
| N/A   35C    P0             25W /  250W |       4MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ 
$ 
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-97) -n nvidia-gpu-operator -- nvidia-smi
error: unable to upgrade connection: container not found ("nvidia-driver-ctr")

computate · 2024-11-07T18:54:44Z

Also this command:

$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-97
NAME                                                  READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-r4czn                           0/1     Init:1/2                0                3m57s
nvidia-container-toolkit-daemonset-5zwpj              0/1     Init:0/1                0                3m57s
nvidia-dcgm-exporter-qcsmc                            1/1     Running                 0                3m57s
nvidia-dcgm-g4kgc                                     1/1     Running                 0                3m57s
nvidia-device-plugin-daemonset-7gf6n                  1/1     Running                 0                3m57s
nvidia-driver-daemonset-415.92.202407191425-0-m6hkg   0/2     Init:CrashLoopBackOff   60 (3m57s ago)   5h42m
nvidia-mig-manager-k59gx                              1/1     Running                 0                3m57s
nvidia-node-status-exporter-schzg                     1/1     Running                 0                22h
nvidia-operator-validator-5kbxt                       0/1     Init:0/4                0                3m57s

$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME                                                  READY   STATUS                     RESTARTS         AGE
gpu-feature-discovery-p4pcz                           1/1     Running                    0                22h
nvidia-container-toolkit-daemonset-c49lj              1/1     Running                    0                22h
nvidia-cuda-validator-9vsjg                           0/1     Completed                  0                22h
nvidia-dcgm-exporter-7xkdt                            1/1     Running                    0                22h
nvidia-dcgm-twrj9                                     1/1     Running                    0                22h
nvidia-device-plugin-daemonset-gjm67                  1/1     Running                    0                22h
nvidia-device-plugin-validator-b25kp                  0/1     UnexpectedAdmissionError   0                6m48s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs   2/2     Running                    0                22h
nvidia-node-status-exporter-zzrdg                     1/1     Running                    0                22h
nvidia-operator-validator-hr5pg                       0/1     Init:CrashLoopBackOff      200 (107s ago)   22h

computate · 2024-11-07T19:00:19Z

$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME                                                  READY   STATUS                     RESTARTS          AGE
gpu-feature-discovery-p4pcz                           1/1     Running                    0                 22h
nvidia-container-toolkit-daemonset-c49lj              1/1     Running                    0                 22h
nvidia-cuda-validator-9vsjg                           0/1     Completed                  0                 22h
nvidia-dcgm-exporter-7xkdt                            1/1     Running                    0                 22h
nvidia-dcgm-twrj9                                     1/1     Running                    0                 22h
nvidia-device-plugin-daemonset-gjm67                  1/1     Running                    0                 22h
nvidia-device-plugin-validator-frd7l                  0/1     UnexpectedAdmissionError   0                 3m27s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs   2/2     Running                    0                 22h
nvidia-node-status-exporter-zzrdg                     1/1     Running                    0                 22h
nvidia-operator-validator-hr5pg                       0/1     Init:3/4                   201 (6m16s ago)   22h
$ 
$ 
$ oc describe pod -n nvidia-gpu-operator nvidia-device-plugin-validator-frd7l
Name:             nvidia-device-plugin-validator-frd7l
Namespace:        nvidia-gpu-operator
Priority:         0
Service Account:  nvidia-operator-validator
Node:             wrk-89/
Start Time:       Thu, 07 Nov 2024 11:54:27 -0700
Labels:           app=nvidia-device-plugin-validator
Annotations:      openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Failed
Reason:           UnexpectedAdmissionError
Message:          Pod was rejected: Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
IP:               
IPs:              <none>
Controlled By:    ClusterPolicy/gpu-cluster-policy
Init Containers:
  plugin-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      vectorAdd
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Containers:
  nvidia-device-plugin-validator:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      echo device-plugin workload validation is successful
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Volumes:
  kube-api-access-8h2dn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                    Age    From     Message
  ----     ------                    ----   ----     -------
  Warning  UnexpectedAdmissionError  3m40s  kubelet  Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected

computate self-assigned this Oct 16, 2024

schwesig mentioned this issue Oct 22, 2024

bug: kruize: gpus not allocatable #782

Closed

5 tasks

computate changed the title ~~NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in nerc-ocp-prod~~ NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters Oct 22, 2024

computate mentioned this issue Oct 29, 2024

Fix to NVIDIA GPU driver devices as volume mounts OCP-on-NERC/nerc-ocp-config#586

Merged

schwesig closed this as completed in OCP-on-NERC/nerc-ocp-config#586 Nov 4, 2024

computate reopened this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768

NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768

computate commented Oct 11, 2024

computate commented Oct 11, 2024

computate commented Oct 14, 2024 •

edited

Loading

computate commented Oct 15, 2024

StHeck commented Oct 16, 2024

dystewart commented Oct 21, 2024

dystewart commented Oct 21, 2024

dystewart commented Oct 22, 2024

joachimweyl commented Oct 22, 2024

computate commented Oct 22, 2024 •

edited

Loading

schwesig commented Oct 24, 2024 •

edited by joachimweyl

Loading

schwesig commented Oct 24, 2024

joachimweyl commented Oct 25, 2024

schwesig commented Oct 29, 2024

joachimweyl commented Nov 1, 2024

computate commented Nov 1, 2024

joachimweyl commented Nov 1, 2024 •

edited

Loading

schwesig commented Nov 5, 2024

computate commented Nov 6, 2024

computate commented Nov 6, 2024

computate commented Nov 7, 2024

computate commented Nov 7, 2024

computate commented Nov 7, 2024

NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768

NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768

Comments

computate commented Oct 11, 2024

computate commented Oct 11, 2024

computate commented Oct 14, 2024 • edited Loading

computate commented Oct 15, 2024

StHeck commented Oct 16, 2024

dystewart commented Oct 21, 2024

dystewart commented Oct 21, 2024

dystewart commented Oct 22, 2024

joachimweyl commented Oct 22, 2024

computate commented Oct 22, 2024 • edited Loading

schwesig commented Oct 24, 2024 • edited by joachimweyl Loading

schwesig commented Oct 24, 2024

joachimweyl commented Oct 25, 2024

schwesig commented Oct 29, 2024

joachimweyl commented Nov 1, 2024

computate commented Nov 1, 2024

joachimweyl commented Nov 1, 2024 • edited Loading

schwesig commented Nov 5, 2024

computate commented Nov 6, 2024

computate commented Nov 6, 2024

computate commented Nov 7, 2024

computate commented Nov 7, 2024

computate commented Nov 7, 2024

computate commented Oct 14, 2024 •

edited

Loading

computate commented Oct 22, 2024 •

edited

Loading

schwesig commented Oct 24, 2024 •

edited by joachimweyl

Loading

joachimweyl commented Nov 1, 2024 •

edited

Loading