Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768

Open
computate opened this issue Oct 11, 2024 · 22 comments · Fixed by OCP-on-NERC/nerc-ocp-config#586
Assignees

Comments

@computate
Copy link
Member

I was trying to figure out why the wrk-99 node in nerc-ocp-prod has 4 GPU devices, but the GPU utilization for wrk-99 on nerc-ocp-prod is not available.

I found that the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-prod is in an OperandNotReady status, but the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-test is in a Ready state.

I also noticed that the plugin-validation container of the nvidia-operator-validator-nnjjx pod in the nvidia-gpu-operator namespace, this container is not becoming ready and has a repeated erro in the log:

time="2024-10-11T15:35:31Z" level=info msg="pod nvidia-device-plugin-validator-6sb75 is curently in Failed phase"

I don't know the reason for this error with the gpu-cluster-policy in nerc-ocp-prod.

@computate
Copy link
Member Author

The pod that is failing is on wrk-88 if that makes a difference, the node looks Ready.

@computate
Copy link
Member Author

computate commented Oct 14, 2024

This is also happening in the nerc-ocp-test cluster now with our only GPU there. I'm pretty sure that means we can't use the GPU because the drivers are not installed on the node wrk-3 because of this.

$ oc -n nvidia-gpu-operator logs -l app=nvidia-device-plugin-validator -c plugin-validation
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

@computate
Copy link
Member Author

I created this Red Hat support case to address this.

@computate computate self-assigned this Oct 16, 2024
@StHeck
Copy link

StHeck commented Oct 16, 2024

A google of the error code led me to this:
https://stackoverflow.com/questions/3253257/cuda-driver-version-is-insufficient-for-cuda-runtime-version

Which has this comment:
"-> CUDA driver version is insufficient for CUDA runtime version

But this error is misleading, by selecting back the NVIDIA(Performance mode) with nvidia-settings utility the problem disappears.

It is not a version problem."

Are the GPUs in power saving mode?

@dystewart
Copy link

@StHeck yeah you're right, just confirmed it's not a version error: https://docs.nvidia.com/deploy/cuda-compatibility/#cuda-11-and-later-defaults-to-minor-version-compatibility

According to nvidia-smi we have a valid config:

| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |

@dystewart
Copy link

Looked at all workloads on wrk-3 node in the nerc-ocp-test cluster, to try and rule out race conditions preventing the GPU operator validator from starting properly. No workloads are competing for the GPU however.

Tried deleting and replacing the clusterPolicy with the following specs:

{
  "apiVersion": "nvidia.com/v1",
  "kind": "ClusterPolicy",
  "metadata": {
    "name": "gpu-cluster-policy"
  },
  "spec": {
    "operator": {
      "defaultRuntime": "crio",
      "use_ocp_driver_toolkit": true,
      "initContainer": {}
    },
    "sandboxWorkloads": {
      "enabled": false,
      "defaultWorkload": "container"
    },
    "driver": {
      "enabled": true,
      "useNvidiaDriverCRD": false,
      "useOpenKernelModules": false,
      "upgradePolicy": {
        "autoUpgrade": true,
        "drain": {
          "deleteEmptyDir": false,
          "enable": false,
          "force": false,
          "timeoutSeconds": 300
        },
        "maxParallelUpgrades": 1,
        "maxUnavailable": "25%",
        "podDeletion": {
          "deleteEmptyDir": false,
          "force": false,
          "timeoutSeconds": 300
        },
        "waitForCompletion": {
          "timeoutSeconds": 0
        }
      },
      "repoConfig": {
        "configMapName": ""
      },
      "certConfig": {
        "name": ""
      },
      "licensingConfig": {
        "nlsEnabled": true,
        "configMapName": ""
      },
      "virtualTopology": {
        "config": ""
      },
      "kernelModuleConfig": {
        "name": ""
      }
    },
    "dcgmExporter": {
      "enabled": true,
      "config": {
        "name": ""
      },
      "serviceMonitor": {
        "enabled": true
      }
    },
    "dcgm": {
      "enabled": true
    },
    "daemonsets": {
      "updateStrategy": "RollingUpdate",
      "rollingUpdate": {
        "maxUnavailable": "1"
      }
    },
    "devicePlugin": {
      "enabled": true,
      "config": {
        "name": "",
        "default": ""
      },
      "mps": {
        "root": "/run/nvidia/mps"
      }
    },
    "gfd": {
      "enabled": true
    },
    "migManager": {
      "enabled": true
    },
    "nodeStatusExporter": {
      "enabled": true
    },
    "mig": {
      "strategy": "single"
    },
    "toolkit": {
      "enabled": true
    },
    "validator": {
      "plugin": {
        "env": [
          {
            "name": "WITH_WORKLOAD",
            "value": "false"
          }
        ]
      }
    },
    "vgpuManager": {
      "enabled": false
    },
    "vgpuDeviceManager": {
      "enabled": true
    },
    "sandboxDevicePlugin": {
      "enabled": true
    },
    "vfioManager": {
      "enabled": true
    },
    "gds": {
      "enabled": false
    },
    "gdrcopy": {
      "enabled": false
    }
  }
}

No luck with the clusterPolicy. We have passed along the must gather to NVIDIA and are awaiting a response from them.

FYI I have disabled auto-sync within ArgoCD while we play around with these resources.

@dystewart
Copy link

Looks like this is also related: #782

@joachimweyl
Copy link
Contributor

@computate can we remove " in nerc-ocp-prod" from the title since now we see the same issue in nerc-ocp-test?

@computate computate changed the title NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in nerc-ocp-prod NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters Oct 22, 2024
@computate
Copy link
Member Author

computate commented Oct 22, 2024

Done changing the title @joachimweyl

@schwesig
Copy link
Member

schwesig commented Oct 24, 2024

as of today 2024-10-24 12:55 ET
nvidia operator version and last update
call with @tssala23 @dystewart @schwesig

cluster version last update error machine config update model failing nodes
prod 24.6.2 Sep 25. yes no A100 yes, V100 no wrk-97
test-2 (kruize) 24.3.0 Oct 21. yes yes Oct 20. A100 yes, V100 n/a MOC-R8PAC23U39
beta 24.6.2 no yes Oct 4. A100 no, V100 n/a
test 24.6.2 yes no A100 yes, V100 yes MOC-R8PAC23U27,

vague error message
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! [Vector addition of 50000 elements]

@schwesig
Copy link
Member

idea for next step:
@taj Salawu removing and readding a GPU node on kruize (test-2) to try a fresh restart

@joachimweyl
Copy link
Contributor

@schwesig do we know what nodes specifically are having these issues? Can you update the table above to include a column for node names?

@schwesig
Copy link
Member

computate added a commit to computate/nerc-ocp-config that referenced this issue Oct 29, 2024
A suggested fix from NVIDIA for the drivers that were going missing on
the GPU nodes in the prod, test, and test-2 clusters.

Fixes nerc-project/operations#768
@joachimweyl
Copy link
Contributor

@computate how many GPUs were out of order? am I correct that they were out of order from Oct 11th - 31st?

@computate
Copy link
Member Author

@joachimweyl Correct about Oct 11th - 31st. I understand there were 2 GPU nodes broken on the test cluster (wrk-3, wrk-4), 2 GPU nodes broken on the prod cluster (wrk-97, wrk-99), and 1 GPU node broken with 4 GPU slices affected on cluster test-2 (wrk-5).

@joachimweyl
Copy link
Contributor

joachimweyl commented Nov 1, 2024

@computate do we know which ones were V100 and which were A100? Actually what would be the most helpful is if we know the node name such as test was using MOC-R8PAC23U27 & test-2 was using MOC-R8PAC23U39, do we know what Prod was using?

@schwesig
Copy link
Member

schwesig commented Nov 5, 2024

  • kruize
  • nerc-ocp-test-2.nerc.mghpcc.org
    applied the changes manually, and it works for them now again

@computate
Copy link
Member Author

Running with these gpu-cluster-policy settings in prod is allowing the drivers to be installed on 9/11 nodes so far.

    manager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"
    repoConfig:
      configMapName: ""
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: true
        force: true
        timeoutSeconds: 300

@computate computate reopened this Nov 6, 2024
@computate
Copy link
Member Author

Here is the list of nodes where the NVIDIA driver is successfully applied now.

$ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep Running
nvidia-operator-validator-697f5                       1/1     Running                    0                116m    10.131.23.245   wrk-99    <none>           <none>
nvidia-operator-validator-8nm2q                       1/1     Running                    0                117m    10.130.19.118   wrk-105   <none>           <none>
nvidia-operator-validator-dd6xw                       1/1     Running                    0                116m    10.129.19.48    wrk-102   <none>           <none>
nvidia-operator-validator-gvlgd                       1/1     Running                    0                115m    10.128.23.147   wrk-107   <none>           <none>
nvidia-operator-validator-kqdwr                       1/1     Running                    0                117m    10.130.24.128   wrk-108   <none>           <none>
nvidia-operator-validator-lpsvx                       1/1     Running                    0                116m    10.129.23.245   wrk-106   <none>           <none>
nvidia-operator-validator-n2pbh                       1/1     Running                    0                112m    10.130.12.56    wrk-88    <none>           <none>
nvidia-operator-validator-ndnkc                       1/1     Running                    0                116m    10.129.24.193   wrk-104   <none>           <none>
nvidia-operator-validator-z8zs8                       1/1     Running                    0                117m    10.128.24.144   wrk-103   <none>           <none>

Currently the NVIDIA driver is failing on wrk-97, but completed on wrk-89.

oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-driver-daemonset | grep -v Running
nvidia-driver-daemonset-415.92.202407191425-0-l6mtg   0/2     Init:CrashLoopBackOff      4 (16s ago)      7m2s   10.129.21.35    wrk-97    <none>           <none>

The NVIDIA driver validation is failing on wrk-97 and wrk-89.

$ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep -v Running
nvidia-operator-validator-fsbfj                       0/1     Init:0/4                   0                71s     10.129.21.78    wrk-97    <none>           <none>
nvidia-operator-validator-hr5pg                       0/1     Init:3/4                   19 (4m21s ago)   115m    10.131.11.208   wrk-89    <none>           <none>

@computate
Copy link
Member Author

Here are some useful commands I learned for viewing GPU utilization.

$ oc --as system:admin get node wrk-89 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-89 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
0
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
0
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-89) -n nvidia-gpu-operator -- nvidia-smi
Thu Nov  7 18:48:25 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           On  |   00000000:3B:00.0 Off |                    0 |
| N/A   35C    P0             25W /  250W |       4MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ 
$ 
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-97) -n nvidia-gpu-operator -- nvidia-smi
error: unable to upgrade connection: container not found ("nvidia-driver-ctr")

@computate
Copy link
Member Author

Also this command:

$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-97
NAME                                                  READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-r4czn                           0/1     Init:1/2                0                3m57s
nvidia-container-toolkit-daemonset-5zwpj              0/1     Init:0/1                0                3m57s
nvidia-dcgm-exporter-qcsmc                            1/1     Running                 0                3m57s
nvidia-dcgm-g4kgc                                     1/1     Running                 0                3m57s
nvidia-device-plugin-daemonset-7gf6n                  1/1     Running                 0                3m57s
nvidia-driver-daemonset-415.92.202407191425-0-m6hkg   0/2     Init:CrashLoopBackOff   60 (3m57s ago)   5h42m
nvidia-mig-manager-k59gx                              1/1     Running                 0                3m57s
nvidia-node-status-exporter-schzg                     1/1     Running                 0                22h
nvidia-operator-validator-5kbxt                       0/1     Init:0/4                0                3m57s

$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME                                                  READY   STATUS                     RESTARTS         AGE
gpu-feature-discovery-p4pcz                           1/1     Running                    0                22h
nvidia-container-toolkit-daemonset-c49lj              1/1     Running                    0                22h
nvidia-cuda-validator-9vsjg                           0/1     Completed                  0                22h
nvidia-dcgm-exporter-7xkdt                            1/1     Running                    0                22h
nvidia-dcgm-twrj9                                     1/1     Running                    0                22h
nvidia-device-plugin-daemonset-gjm67                  1/1     Running                    0                22h
nvidia-device-plugin-validator-b25kp                  0/1     UnexpectedAdmissionError   0                6m48s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs   2/2     Running                    0                22h
nvidia-node-status-exporter-zzrdg                     1/1     Running                    0                22h
nvidia-operator-validator-hr5pg                       0/1     Init:CrashLoopBackOff      200 (107s ago)   22h

@computate
Copy link
Member Author

$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME                                                  READY   STATUS                     RESTARTS          AGE
gpu-feature-discovery-p4pcz                           1/1     Running                    0                 22h
nvidia-container-toolkit-daemonset-c49lj              1/1     Running                    0                 22h
nvidia-cuda-validator-9vsjg                           0/1     Completed                  0                 22h
nvidia-dcgm-exporter-7xkdt                            1/1     Running                    0                 22h
nvidia-dcgm-twrj9                                     1/1     Running                    0                 22h
nvidia-device-plugin-daemonset-gjm67                  1/1     Running                    0                 22h
nvidia-device-plugin-validator-frd7l                  0/1     UnexpectedAdmissionError   0                 3m27s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs   2/2     Running                    0                 22h
nvidia-node-status-exporter-zzrdg                     1/1     Running                    0                 22h
nvidia-operator-validator-hr5pg                       0/1     Init:3/4                   201 (6m16s ago)   22h
$ 
$ 
$ oc describe pod -n nvidia-gpu-operator nvidia-device-plugin-validator-frd7l
Name:             nvidia-device-plugin-validator-frd7l
Namespace:        nvidia-gpu-operator
Priority:         0
Service Account:  nvidia-operator-validator
Node:             wrk-89/
Start Time:       Thu, 07 Nov 2024 11:54:27 -0700
Labels:           app=nvidia-device-plugin-validator
Annotations:      openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Failed
Reason:           UnexpectedAdmissionError
Message:          Pod was rejected: Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
IP:               
IPs:              <none>
Controlled By:    ClusterPolicy/gpu-cluster-policy
Init Containers:
  plugin-validation:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      vectorAdd
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Containers:
  nvidia-device-plugin-validator:
    Image:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      echo device-plugin workload validation is successful
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Volumes:
  kube-api-access-8h2dn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                    Age    From     Message
  ----     ------                    ----   ----     -------
  Warning  UnexpectedAdmissionError  3m40s  kubelet  Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants