-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768
Comments
The pod that is failing is on wrk-88 if that makes a difference, the node looks Ready. |
This is also happening in the $ oc -n nvidia-gpu-operator logs -l app=nvidia-device-plugin-validator -c plugin-validation
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements] |
I created this Red Hat support case to address this. |
A google of the error code led me to this: Which has this comment: But this error is misleading, by selecting back the NVIDIA(Performance mode) with nvidia-settings utility the problem disappears. It is not a version problem." Are the GPUs in power saving mode? |
@StHeck yeah you're right, just confirmed it's not a version error: https://docs.nvidia.com/deploy/cuda-compatibility/#cuda-11-and-later-defaults-to-minor-version-compatibility According to nvidia-smi we have a valid config:
|
Looked at all workloads on wrk-3 node in the nerc-ocp-test cluster, to try and rule out race conditions preventing the GPU operator validator from starting properly. No workloads are competing for the GPU however. Tried deleting and replacing the clusterPolicy with the following specs:
No luck with the clusterPolicy. We have passed along the must gather to NVIDIA and are awaiting a response from them. FYI I have disabled auto-sync within ArgoCD while we play around with these resources. |
Looks like this is also related: #782 |
@computate can we remove " in nerc-ocp-prod" from the title since now we see the same issue in nerc-ocp-test? |
Done changing the title @joachimweyl |
as of today 2024-10-24 12:55 ET
vague error message |
idea for next step: |
@schwesig do we know what nodes specifically are having these issues? Can you update the table above to include a column for node names? |
A suggested fix from NVIDIA for the drivers that were going missing on the GPU nodes in the prod, test, and test-2 clusters. Fixes nerc-project/operations#768
@computate how many GPUs were out of order? am I correct that they were out of order from Oct 11th - 31st? |
@joachimweyl Correct about Oct 11th - 31st. I understand there were 2 GPU nodes broken on the test cluster (wrk-3, wrk-4), 2 GPU nodes broken on the prod cluster (wrk-97, wrk-99), and 1 GPU node broken with 4 GPU slices affected on cluster test-2 (wrk-5). |
@computate do we know which ones were V100 and which were A100? Actually what would be the most helpful is if we know the node name such as test was using MOC-R8PAC23U27 & test-2 was using MOC-R8PAC23U39, do we know what Prod was using? |
|
Running with these gpu-cluster-policy settings in prod is allowing the drivers to be installed on 9/11 nodes so far. manager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: "true"
- name: ENABLE_AUTO_DRAIN
value: "true"
- name: DRAIN_USE_FORCE
value: "true"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "true"
repoConfig:
configMapName: ""
upgradePolicy:
autoUpgrade: false
drain:
deleteEmptyDir: true
enable: true
force: true
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: true
force: true
timeoutSeconds: 300 |
Here is the list of nodes where the NVIDIA driver is successfully applied now. $ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep Running
nvidia-operator-validator-697f5 1/1 Running 0 116m 10.131.23.245 wrk-99 <none> <none>
nvidia-operator-validator-8nm2q 1/1 Running 0 117m 10.130.19.118 wrk-105 <none> <none>
nvidia-operator-validator-dd6xw 1/1 Running 0 116m 10.129.19.48 wrk-102 <none> <none>
nvidia-operator-validator-gvlgd 1/1 Running 0 115m 10.128.23.147 wrk-107 <none> <none>
nvidia-operator-validator-kqdwr 1/1 Running 0 117m 10.130.24.128 wrk-108 <none> <none>
nvidia-operator-validator-lpsvx 1/1 Running 0 116m 10.129.23.245 wrk-106 <none> <none>
nvidia-operator-validator-n2pbh 1/1 Running 0 112m 10.130.12.56 wrk-88 <none> <none>
nvidia-operator-validator-ndnkc 1/1 Running 0 116m 10.129.24.193 wrk-104 <none> <none>
nvidia-operator-validator-z8zs8 1/1 Running 0 117m 10.128.24.144 wrk-103 <none> <none> Currently the NVIDIA driver is failing on oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-driver-daemonset | grep -v Running
nvidia-driver-daemonset-415.92.202407191425-0-l6mtg 0/2 Init:CrashLoopBackOff 4 (16s ago) 7m2s 10.129.21.35 wrk-97 <none> <none> The NVIDIA driver validation is failing on $ oc -n nvidia-gpu-operator get pod -o wide | grep nvidia-operator-validator | grep -v Running
nvidia-operator-validator-fsbfj 0/1 Init:0/4 0 71s 10.129.21.78 wrk-97 <none> <none>
nvidia-operator-validator-hr5pg 0/1 Init:3/4 19 (4m21s ago) 115m 10.131.11.208 wrk-89 <none> <none> |
Here are some useful commands I learned for viewing GPU utilization. $ oc --as system:admin get node wrk-89 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-89 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
1
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.allocatable.nvidia\.com/gpu'}; echo
0
$ oc --as system:admin get node wrk-97 -o jsonpath={'.status.capacity.nvidia\.com/gpu'}; echo
0
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-89) -n nvidia-gpu-operator -- nvidia-smi
Thu Nov 7 18:48:25 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-PCIE-32GB On | 00000000:3B:00.0 Off | 0 |
| N/A 35C P0 25W / 250W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$
$
$ oc exec -ti $(oc get pod -o name -n nvidia-gpu-operator -l=app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=wrk-97) -n nvidia-gpu-operator -- nvidia-smi
error: unable to upgrade connection: container not found ("nvidia-driver-ctr") |
Also this command: $ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-97
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-r4czn 0/1 Init:1/2 0 3m57s
nvidia-container-toolkit-daemonset-5zwpj 0/1 Init:0/1 0 3m57s
nvidia-dcgm-exporter-qcsmc 1/1 Running 0 3m57s
nvidia-dcgm-g4kgc 1/1 Running 0 3m57s
nvidia-device-plugin-daemonset-7gf6n 1/1 Running 0 3m57s
nvidia-driver-daemonset-415.92.202407191425-0-m6hkg 0/2 Init:CrashLoopBackOff 60 (3m57s ago) 5h42m
nvidia-mig-manager-k59gx 1/1 Running 0 3m57s
nvidia-node-status-exporter-schzg 1/1 Running 0 22h
nvidia-operator-validator-5kbxt 0/1 Init:0/4 0 3m57s
$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-p4pcz 1/1 Running 0 22h
nvidia-container-toolkit-daemonset-c49lj 1/1 Running 0 22h
nvidia-cuda-validator-9vsjg 0/1 Completed 0 22h
nvidia-dcgm-exporter-7xkdt 1/1 Running 0 22h
nvidia-dcgm-twrj9 1/1 Running 0 22h
nvidia-device-plugin-daemonset-gjm67 1/1 Running 0 22h
nvidia-device-plugin-validator-b25kp 0/1 UnexpectedAdmissionError 0 6m48s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs 2/2 Running 0 22h
nvidia-node-status-exporter-zzrdg 1/1 Running 0 22h
nvidia-operator-validator-hr5pg 0/1 Init:CrashLoopBackOff 200 (107s ago) 22h |
$ oc get pod -n nvidia-gpu-operator --field-selector spec.nodeName=wrk-89
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-p4pcz 1/1 Running 0 22h
nvidia-container-toolkit-daemonset-c49lj 1/1 Running 0 22h
nvidia-cuda-validator-9vsjg 0/1 Completed 0 22h
nvidia-dcgm-exporter-7xkdt 1/1 Running 0 22h
nvidia-dcgm-twrj9 1/1 Running 0 22h
nvidia-device-plugin-daemonset-gjm67 1/1 Running 0 22h
nvidia-device-plugin-validator-frd7l 0/1 UnexpectedAdmissionError 0 3m27s
nvidia-driver-daemonset-415.92.202407191425-0-bs2xs 2/2 Running 0 22h
nvidia-node-status-exporter-zzrdg 1/1 Running 0 22h
nvidia-operator-validator-hr5pg 0/1 Init:3/4 201 (6m16s ago) 22h
$
$
$ oc describe pod -n nvidia-gpu-operator nvidia-device-plugin-validator-frd7l
Name: nvidia-device-plugin-validator-frd7l
Namespace: nvidia-gpu-operator
Priority: 0
Service Account: nvidia-operator-validator
Node: wrk-89/
Start Time: Thu, 07 Nov 2024 11:54:27 -0700
Labels: app=nvidia-device-plugin-validator
Annotations: openshift.io/scc: restricted-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod was rejected: Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
IP:
IPs: <none>
Controlled By: ClusterPolicy/gpu-cluster-policy
Init Containers:
plugin-validation:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
vectorAdd
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Containers:
nvidia-device-plugin-validator:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:70a0bd29259820d6257b04b0cdb6a175f9783d4dd19ccc4ec6599d407c359ba5
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo device-plugin workload validation is successful
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8h2dn (ro)
Volumes:
kube-api-access-8h2dn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning UnexpectedAdmissionError 3m40s kubelet Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected |
I was trying to figure out why the wrk-99 node in nerc-ocp-prod has 4 GPU devices, but the GPU utilization for wrk-99 on nerc-ocp-prod is not available.
I found that the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-prod is in an
OperandNotReady
status, but the NVIDIA GPU Operator gpu-cluster-policy on nerc-ocp-test is in aReady
state.I also noticed that the
plugin-validation
container of thenvidia-operator-validator-nnjjx
pod in thenvidia-gpu-operator
namespace, this container is not becoming ready and has a repeated erro in the log:time="2024-10-11T15:35:31Z" level=info msg="pod nvidia-device-plugin-validator-6sb75 is curently in Failed phase"
I don't know the reason for this error with the gpu-cluster-policy in nerc-ocp-prod.
The text was updated successfully, but these errors were encountered: