Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to NVIDIA GPU driver devices as volume mounts #586

Merged
merged 1 commit into from
Nov 4, 2024

Conversation

computate
Copy link
Member

@computate computate commented Oct 29, 2024

A suggested fix from NVIDIA for the drivers that were going missing on
the GPU nodes in the prod, test, and test-2 clusters. This fixes a bug with unprivileged access to device plugin.

These fixes to the GPU Cluster Policy have resolved the GPU driver issues on the test cluster.

Fixes nerc-project/operations#768

Important note: This PR requires removing the GPU Cluster Policy and uninstalling the nvidia-gpu-operator and reinstalling GPU Operator and GPU Cluster Policy.

In draft until the code freeze is over

@computate
Copy link
Member Author

  • Applying this fix to the test cluster involves disabling auto-sync for the cluster-scope-test and nvidia-gpu-operator-test ArgoCD applications.
  • Delete the GPU ClusterPolicy and MIG ConfigMap oc --as system:admin delete -k nvidia-gpu-operator/overlays/nerc-ocp-test/
  • Delete the NVIDIA GPU Operator oc --as system:admin delete -k cluster-scope/bundles/nvidia-gpu-operator/
  • Re-enable auto-sync for the cluster-scope-test, and run a Sync.
  • Re-enable auto-sync for the nvidia-gpu-operator-test, and run a Sync.
  • Check for the NVIDIA pods to start Running oc get pods -n nvidia-gpu-operator

A suggested fix from NVIDIA for the drivers that were going missing on
the GPU nodes in the prod, test, and test-2 clusters.

Fixes nerc-project/operations#768
@schwesig schwesig added the bug Something isn't working label Oct 31, 2024
@computate
Copy link
Member Author

Another update I recevied is because we set up the validator plugin, which is by default is set to false in the clusterPolicy:

  validator:
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "true"

We need this workaround to use volume mounts because of the validator pods is not running as privileged.

@computate computate marked this pull request as ready for review November 1, 2024 14:11
@schwesig schwesig merged commit 81ec27f into OCP-on-NERC:main Nov 4, 2024
2 checks passed
@schwesig
Copy link
Contributor

schwesig commented Nov 5, 2024

  • kruize
  • nerc-ocp-test-2.nerc.mghpcc.org
    applied the changes manually, and it works for them now again

@computate
Copy link
Member Author

Since this PR was merged, and we still have a pod failing on wrk-97, And currently there is 0 GPU utilization in the prod cluster, I'm going to proceed with uninstalling the NVIDIA GPU Operator and reinstalling it in the prod cluster to finish resolving this issue.

$ oc -n nvidia-gpu-operator get pod/nvidia-operator-validator-nj8hx
NAME                              READY   STATUS                  RESTARTS          AGE
nvidia-operator-validator-nj8hx   0/1     Init:CrashLoopBackOff   1046 (2m4s ago)   5d2h

@computate
Copy link
Member Author

Sorry again for kicking off this update of the NVIDIA operator in prod. I think these are not restarting because draining the node is blocked:
https://console.apps.shift.nerc.mghpcc.org/api/kubernetes/api/v1/namespaces/nvidia-gpu-operato[…]2407191425-0-4s8cl/log?container=k8s-driver-manager

Auto drain of the node wrk-103 is disabled by the upgrade policy
Failed to uninstall nvidia driver components
Auto eviction of GPU pods on node wrk-103 is disabled by the upgrade policy
Auto drain of the node wrk-103 is disabled by the upgrade policy
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/wrk-103 labeled

@computate
Copy link
Member Author

Running with these gpu-cluster-policy settings is allowing the drivers to be installed on 9/11 nodes so far.

    manager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"
    repoConfig:
      configMapName: ""
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: true
        force: true
        timeoutSeconds: 300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters
4 participants