-
Notifications
You must be signed in to change notification settings - Fork 102
Upgrade NVIDIA driver for CUDA 12.6+ #6530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
Do we have some kind of plan or notification to the team periodically to remind us to update versions of things? (Especially for core packages like GPU drivers for this project.) |
NVIDIA guarantees FWD compatibility, so this should not be necessary, but I guess error stems from the fact that we installed a too new docker, but an older driver, and docker bypasses makes no guarantees about that |
Let me note it down to check with @atalman when he's back to see if we have a runbook for updating CUDA. I think this step could be added to that runbook. |
This is the second part of #6530
This is reported by @jainapurva from TorchAO https://github.com/pytorch/ao/actions/runs/14461343872/job/40554407661 trying to upgrade CUDA from 12.4 to 12.6 pytorch/ao#1962. It turns out that the NVIDIA driver that we are currently using
550.54.15
is too old.I grab the latest production driver from NVIDIA which should satisfy not only CUDA 12.6 but 12.8 too:
This should help fix the issue on AO, also need to update this driver in a couple of other places too.
Testing
docker run --gpus all -it pytorch/almalinux-builder:cuda12.6 /bin/bash
Also test this out on AO https://github.com/pytorch/ao/actions/runs/14481627872/job/40619564626I think this needs to be landed first as AO jobs still point totest-infra@main