Upgrade cuda from 12.4 -> 12.6 #1962

jainapurva · 2025-03-26T16:19:34Z

Updating the cuda version from 12.4 -> 12.6, as 12.4 is not supported anymore.

All pytorch nightly tests will be using cu12.6 going forward. Only H100 tests are still using cu124 version with pytorch nightly, as H100 driver upgrade could not be done. Once the issue is fixed for H100 driver upgrade, we’ll be able to update our CI tests accordingly

[ghstack-poisoned]

jainapurva · 2025-03-26T16:19:36Z

Stack from ghstack (oldest at bottom):

-> Upgrade cuda from 12.4 -> 12.6 #1962

pytorch-bot · 2025-03-26T16:19:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1962

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 2 Pending

As of commit c253dc0 with merge base 7fa9c69 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: cb1cfff2745bacc99e73276fa2a487ee316bb71d ghstack-comment-id: 2754997519 Pull Request resolved: #1962

drisspg · 2025-03-26T21:03:06Z

Seems like CI is unhappy though

jainapurva · 2025-03-26T21:05:09Z

Seems like CI is unhappy though

Yes, it's not able to install the 12.6 driver. @atalman is looking into it.

jainapurva · 2025-04-14T22:42:11Z

@huydhn Can you please take a look at this.

huydhn

LGTM!

huydhn · 2025-04-15T01:15:44Z

Let me take a look at the CI error

.github/workflows/float8_test.yml

.github/workflows/float8nocompile_test.yaml

.github/workflows/nightly_smoke_test.yml

.github/workflows/regression_test.yml

…load.pytorch.org/whl/cu126

@jainapurva

This is reported by @jainapurva from TorchAO https://github.com/pytorch/ao/actions/runs/14461343872/job/40554407661 trying to upgrade CUDA from 12.4 to 12.6 pytorch/ao#1962. It turns out that the NVIDIA driver that we are currently using `550.54.15` is too old. I grab the latest production driver from NVIDIA which should satisfy not only CUDA 12.6 but 12.8 too: * https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html This should help fix the issue on AO, also need to update this driver in a couple of other places too. ### Testing * Manual. I install the driver manually and can start the container fine without any issue `docker run --gpus all -it pytorch/almalinux-builder:cuda12.6 /bin/bash` * https://github.com/pytorch/test-infra/actions/runs/14481525016 * ~~Also test this out on AO https://github.com/pytorch/ao/actions/runs/14481627872/job/40619564626~~ I think this needs to be landed first as AO jobs still point to `test-infra@main`

huydhn · 2025-04-16T17:35:24Z

I think I have had almost everything working with 2 remaining issues:

There is no PyTorch 2.5.1 with CUDA 12.6. This is an old release, so I think that test needs to be updated / removed https://github.com/pytorch/ao/actions/runs/14498427545/job/40672311068?pr=1962#step:15:720. @jainapurva Please help take a look
The NVIDIA driver on H100 still need to be updated https://github.com/pytorch/ao/actions/runs/14498428371/job/40672018968?pr=1962 separately because it is setup in a different way than a regular GPU runner. I need to check with @jeanschmidt to see if there is a way to update the driver there without the need to create a new H100 runner.

Update

9597af6

[ghstack-poisoned]

jainapurva added a commit that referenced this pull request Mar 26, 2025

Updgrade cuda from 12.4 -> 12.6

8380c21

ghstack-source-id: cb1cfff2745bacc99e73276fa2a487ee316bb71d ghstack-comment-id: 2754997519 Pull Request resolved: #1962

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2025

jainapurva changed the title ~~Updgrade cuda from 12.4 -> 12.6~~ Upgrade cuda from 12.4 -> 12.6 Mar 26, 2025

jainapurva requested review from drisspg and atalman March 26, 2025 16:20

jainapurva added topic: not user facing Use this tag if you don't want this PR to show up in release notes ci and removed ci labels Mar 26, 2025

drisspg approved these changes Mar 26, 2025

View reviewed changes

huydhn approved these changes Apr 15, 2025

View reviewed changes