-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to NVIDIA GPU driver devices as volume mounts #586
Conversation
|
A suggested fix from NVIDIA for the drivers that were going missing on the GPU nodes in the prod, test, and test-2 clusters. Fixes nerc-project/operations#768
f107a5b
to
d132a21
Compare
Another update I recevied is because we set up the validator plugin, which is by default is set to false in the clusterPolicy: validator:
plugin:
env:
- name: WITH_WORKLOAD
value: "true" We need this workaround to use volume mounts because of the validator pods is not running as privileged. |
|
Since this PR was merged, and we still have a pod failing on $ oc -n nvidia-gpu-operator get pod/nvidia-operator-validator-nj8hx
NAME READY STATUS RESTARTS AGE
nvidia-operator-validator-nj8hx 0/1 Init:CrashLoopBackOff 1046 (2m4s ago) 5d2h |
Sorry again for kicking off this update of the NVIDIA operator in prod. I think these are not restarting because draining the node is blocked:
|
Running with these gpu-cluster-policy settings is allowing the drivers to be installed on 9/11 nodes so far. manager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: "true"
- name: ENABLE_AUTO_DRAIN
value: "true"
- name: DRAIN_USE_FORCE
value: "true"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "true"
repoConfig:
configMapName: ""
upgradePolicy:
autoUpgrade: false
drain:
deleteEmptyDir: true
enable: true
force: true
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: true
force: true
timeoutSeconds: 300 |
A suggested fix from NVIDIA for the drivers that were going missing on
the GPU nodes in the prod, test, and test-2 clusters. This fixes a bug with unprivileged access to device plugin.
These fixes to the GPU Cluster Policy have resolved the GPU driver issues on the test cluster.
Fixes nerc-project/operations#768
Important note: This PR requires removing the GPU Cluster Policy and uninstalling the nvidia-gpu-operator and reinstalling GPU Operator and GPU Cluster Policy.
In draft until the code freeze is over