Skip to content

Upgrade NVIDIA driver for CUDA 12.6+ #6530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 16, 2025
Merged

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Apr 15, 2025

This is reported by @jainapurva from TorchAO https://github.com/pytorch/ao/actions/runs/14461343872/job/40554407661 trying to upgrade CUDA from 12.4 to 12.6 pytorch/ao#1962. It turns out that the NVIDIA driver that we are currently using 550.54.15 is too old.

I grab the latest production driver from NVIDIA which should satisfy not only CUDA 12.6 but 12.8 too:

This should help fix the issue on AO, also need to update this driver in a couple of other places too.

Testing

@huydhn huydhn requested review from seemethere and malfet April 15, 2025 23:34
Copy link

vercel bot commented Apr 15, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Apr 15, 2025 11:45pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 15, 2025
huydhn added a commit to pytorch/ao that referenced this pull request Apr 15, 2025
huydhn added a commit to pytorch/ao that referenced this pull request Apr 15, 2025
@zxiiro
Copy link
Collaborator

zxiiro commented Apr 16, 2025

Do we have some kind of plan or notification to the team periodically to remind us to update versions of things?

(Especially for core packages like GPU drivers for this project.)

@malfet malfet merged commit 5116146 into main Apr 16, 2025
7 checks passed
@malfet malfet deleted the update-nvidia-driver-cuda12.6 branch April 16, 2025 14:59
@malfet
Copy link
Contributor

malfet commented Apr 16, 2025

NVIDIA guarantees FWD compatibility, so this should not be necessary, but I guess error stems from the fact that we installed a too new docker, but an older driver, and docker bypasses makes no guarantees about that

@huydhn
Copy link
Contributor Author

huydhn commented Apr 16, 2025

Do we have some kind of plan or notification to the team periodically to remind us to update versions of things?

(Especially for core packages like GPU drivers for this project.)

Let me note it down to check with @atalman when he's back to see if we have a runbook for updating CUDA. I think this step could be added to that runbook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants