Fix to NVIDIA GPU driver devices as volume mounts #586

computate · 2024-10-29T20:48:54Z

A suggested fix from NVIDIA for the drivers that were going missing on
the GPU nodes in the prod, test, and test-2 clusters. This fixes a bug with unprivileged access to device plugin.

These fixes to the GPU Cluster Policy have resolved the GPU driver issues on the test cluster.

Fixes nerc-project/operations#768

Important note: This PR requires removing the GPU Cluster Policy and uninstalling the nvidia-gpu-operator and reinstalling GPU Operator and GPU Cluster Policy.

In draft until the code freeze is over

computate · 2024-10-29T21:08:59Z

Applying this fix to the test cluster involves disabling auto-sync for the cluster-scope-test and nvidia-gpu-operator-test ArgoCD applications.
Delete the GPU ClusterPolicy and MIG ConfigMap oc --as system:admin delete -k nvidia-gpu-operator/overlays/nerc-ocp-test/
Delete the NVIDIA GPU Operator oc --as system:admin delete -k cluster-scope/bundles/nvidia-gpu-operator/
Re-enable auto-sync for the cluster-scope-test, and run a Sync.
Re-enable auto-sync for the nvidia-gpu-operator-test, and run a Sync.
Check for the NVIDIA pods to start Running oc get pods -n nvidia-gpu-operator

A suggested fix from NVIDIA for the drivers that were going missing on the GPU nodes in the prod, test, and test-2 clusters. Fixes nerc-project/operations#768

computate · 2024-10-31T18:19:12Z

Another update I recevied is because we set up the validator plugin, which is by default is set to false in the clusterPolicy:

  validator:
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "true"

We need this workaround to use volume mounts because of the validator pods is not running as privileged.

schwesig · 2024-11-05T10:31:16Z

kruize
nerc-ocp-test-2.nerc.mghpcc.org
applied the changes manually, and it works for them now again

computate · 2024-11-05T21:50:37Z

Since this PR was merged, and we still have a pod failing on wrk-97, And currently there is 0 GPU utilization in the prod cluster, I'm going to proceed with uninstalling the NVIDIA GPU Operator and reinstalling it in the prod cluster to finish resolving this issue.

$ oc -n nvidia-gpu-operator get pod/nvidia-operator-validator-nj8hx
NAME                              READY   STATUS                  RESTARTS          AGE
nvidia-operator-validator-nj8hx   0/1     Init:CrashLoopBackOff   1046 (2m4s ago)   5d2h

computate · 2024-11-05T22:59:21Z

Sorry again for kicking off this update of the NVIDIA operator in prod. I think these are not restarting because draining the node is blocked:
https://console.apps.shift.nerc.mghpcc.org/api/kubernetes/api/v1/namespaces/nvidia-gpu-operato[…]2407191425-0-4s8cl/log?container=k8s-driver-manager

Auto drain of the node wrk-103 is disabled by the upgrade policy
Failed to uninstall nvidia driver components
Auto eviction of GPU pods on node wrk-103 is disabled by the upgrade policy
Auto drain of the node wrk-103 is disabled by the upgrade policy
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/wrk-103 labeled

computate · 2024-11-06T22:10:17Z

Running with these gpu-cluster-policy settings is allowing the drivers to be installed on 9/11 nodes so far.

    manager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"
    repoConfig:
      configMapName: ""
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: true
        force: true
        timeoutSeconds: 300

computate marked this pull request as draft October 29, 2024 20:49

computate requested review from dystewart, jtriley, larsks, joachimweyl, dheerajodha, RH-csaggin, naved001, tssala23 and schwesig October 29, 2024 20:50

Fix to NVIDIA GPU driver devices as volume mounts

d132a21

A suggested fix from NVIDIA for the drivers that were going missing on the GPU nodes in the prod, test, and test-2 clusters. Fixes nerc-project/operations#768

computate force-pushed the nvidia-gpus branch from f107a5b to d132a21 Compare October 29, 2024 21:12

computate mentioned this pull request Oct 29, 2024

bug: kruize: gpus not allocatable nerc-project/operations#782

Closed

5 tasks

schwesig approved these changes Oct 29, 2024

View reviewed changes

tssala23 approved these changes Oct 30, 2024

View reviewed changes

schwesig assigned computate Oct 31, 2024

schwesig added the bug Something isn't working label Oct 31, 2024

computate marked this pull request as ready for review November 1, 2024 14:11

larsks approved these changes Nov 1, 2024

View reviewed changes

schwesig merged commit 81ec27f into OCP-on-NERC:main Nov 4, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix to NVIDIA GPU driver devices as volume mounts #586

Fix to NVIDIA GPU driver devices as volume mounts #586

computate commented Oct 29, 2024 •

edited

Loading

computate commented Oct 29, 2024

computate commented Oct 31, 2024

schwesig commented Nov 5, 2024

computate commented Nov 5, 2024

computate commented Nov 5, 2024

computate commented Nov 6, 2024

Fix to NVIDIA GPU driver devices as volume mounts #586

Fix to NVIDIA GPU driver devices as volume mounts #586

Conversation

computate commented Oct 29, 2024 • edited Loading

computate commented Oct 29, 2024

computate commented Oct 31, 2024

schwesig commented Nov 5, 2024

computate commented Nov 5, 2024

computate commented Nov 5, 2024

computate commented Nov 6, 2024

computate commented Oct 29, 2024 •

edited

Loading