Upgrade NVIDIA driver for CUDA 12.6+ #6530

huydhn · 2025-04-15T23:34:56Z

This is reported by @jainapurva from TorchAO https://github.com/pytorch/ao/actions/runs/14461343872/job/40554407661 trying to upgrade CUDA from 12.4 to 12.6 pytorch/ao#1962. It turns out that the NVIDIA driver that we are currently using 550.54.15 is too old.

I grab the latest production driver from NVIDIA which should satisfy not only CUDA 12.6 but 12.8 too:

https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html

This should help fix the issue on AO, also need to update this driver in a couple of other places too.

Testing

Manual. I install the driver manually and can start the container fine without any issue docker run --gpus all -it pytorch/almalinux-builder:cuda12.6 /bin/bash
https://github.com/pytorch/test-infra/actions/runs/14481525016
~~Also test this out on AO https://github.com/pytorch/ao/actions/runs/14481627872/job/40619564626~~ I think this needs to be landed first as AO jobs still point to test-infra@main

vercel · 2025-04-15T23:35:01Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Updated (UTC)
torchci	⬜️ Ignored (Inspect)	Visit Preview	Apr 15, 2025 11:45pm

zxiiro · 2025-04-16T12:41:00Z

Do we have some kind of plan or notification to the team periodically to remind us to update versions of things?

(Especially for core packages like GPU drivers for this project.)

malfet · 2025-04-16T15:00:26Z

NVIDIA guarantees FWD compatibility, so this should not be necessary, but I guess error stems from the fact that we installed a too new docker, but an older driver, and docker bypasses makes no guarantees about that

huydhn · 2025-04-16T16:25:03Z

Do we have some kind of plan or notification to the team periodically to remind us to update versions of things?

(Especially for core packages like GPU drivers for this project.)

Let me note it down to check with @atalman when he's back to see if we have a runbook for updating CUDA. I think this step could be added to that runbook.

This is the second part of #6530

Upgrade NVIDIA driver for CUDA 12.6+

4d96d98

huydhn requested review from seemethere and malfet April 15, 2025 23:34

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 15, 2025

huydhn added 2 commits April 15, 2025 16:38

Turn off pm before

cb91f83

nvidia-persistenced?

26c41cf

huydhn added a commit to pytorch/ao that referenced this pull request Apr 15, 2025

Testing pytorch/test-infra#6530

107e8ca

huydhn added a commit to pytorch/ao that referenced this pull request Apr 15, 2025

Testing pytorch/test-infra#6530

7575d95

zxiiro approved these changes Apr 16, 2025

View reviewed changes

malfet approved these changes Apr 16, 2025

View reviewed changes

malfet merged commit 5116146 into main Apr 16, 2025
7 checks passed

malfet deleted the update-nvidia-driver-cuda12.6 branch April 16, 2025 14:59

huydhn mentioned this pull request Apr 16, 2025

Update NVIDIA driver to 570.133.07 when launching the runner #6532

Merged

malfet pushed a commit that referenced this pull request Apr 16, 2025

Update NVIDIA driver to 570.133.07 when launching the runner (#6532)

501c023

This is the second part of #6530

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upgrade NVIDIA driver for CUDA 12.6+ #6530

Upgrade NVIDIA driver for CUDA 12.6+ #6530

Uh oh!

huydhn commented Apr 15, 2025 •

edited

Loading

Uh oh!

vercel bot commented Apr 15, 2025 •

edited

Loading

Uh oh!

zxiiro commented Apr 16, 2025

Uh oh!

Uh oh!

malfet commented Apr 16, 2025

Uh oh!

huydhn commented Apr 16, 2025

Uh oh!

Uh oh!

Upgrade NVIDIA driver for CUDA 12.6+ #6530

Upgrade NVIDIA driver for CUDA 12.6+ #6530

Uh oh!

Conversation

huydhn commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

vercel bot commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zxiiro commented Apr 16, 2025

Uh oh!

Uh oh!

malfet commented Apr 16, 2025

Uh oh!

huydhn commented Apr 16, 2025

Uh oh!

Uh oh!

huydhn commented Apr 15, 2025 •

edited

Loading

vercel bot commented Apr 15, 2025 •

edited

Loading