Skip to content

Upgrade cuda from 12.4 -> 12.6 #1962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 17, 2025
Merged

Upgrade cuda from 12.4 -> 12.6 #1962

merged 16 commits into from
Apr 17, 2025

Conversation

jainapurva
Copy link
Contributor

@jainapurva jainapurva commented Mar 26, 2025

Updating the cuda version from 12.4 -> 12.6, as 12.4 is not supported anymore.

All pytorch nightly tests will be using cu12.6 going forward. Only H100 tests are still using cu124 version with pytorch nightly, as H100 driver upgrade could not be done. Once the issue is fixed for H100 driver upgrade, we’ll be able to update our CI tests accordingly

[ghstack-poisoned]
@jainapurva
Copy link
Contributor Author

jainapurva commented Mar 26, 2025

Stack from ghstack (oldest at bottom):

Copy link

pytorch-bot bot commented Mar 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1962

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 2 Pending

As of commit c253dc0 with merge base 7fa9c69 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jainapurva added a commit that referenced this pull request Mar 26, 2025
ghstack-source-id: cb1cfff2745bacc99e73276fa2a487ee316bb71d
ghstack-comment-id: 2754997519
Pull Request resolved: #1962
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2025
@jainapurva jainapurva changed the title Updgrade cuda from 12.4 -> 12.6 Upgrade cuda from 12.4 -> 12.6 Mar 26, 2025
@jainapurva jainapurva requested review from drisspg and atalman March 26, 2025 16:20
@jainapurva jainapurva added topic: not user facing Use this tag if you don't want this PR to show up in release notes ci and removed ci labels Mar 26, 2025
@drisspg
Copy link
Contributor

drisspg commented Mar 26, 2025

Seems like CI is unhappy though

@jainapurva
Copy link
Contributor Author

Seems like CI is unhappy though

Yes, it's not able to install the 12.6 driver. @atalman is looking into it.

@jainapurva
Copy link
Contributor Author

@huydhn Can you please take a look at this.

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@huydhn
Copy link
Contributor

huydhn commented Apr 15, 2025

Let me take a look at the CI error

malfet pushed a commit to pytorch/test-infra that referenced this pull request Apr 16, 2025
This is reported by @jainapurva from TorchAO
https://github.com/pytorch/ao/actions/runs/14461343872/job/40554407661
trying to upgrade CUDA from 12.4 to 12.6
pytorch/ao#1962. It turns out that the NVIDIA
driver that we are currently using `550.54.15` is too old.

I grab the latest production driver from NVIDIA which should satisfy not
only CUDA 12.6 but 12.8 too:

*
https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html

This should help fix the issue on AO, also need to update this driver in
a couple of other places too.

### Testing

* Manual. I install the driver manually and can start the container fine
without any issue `docker run --gpus all -it
pytorch/almalinux-builder:cuda12.6 /bin/bash`
* https://github.com/pytorch/test-infra/actions/runs/14481525016
* ~~Also test this out on AO
https://github.com/pytorch/ao/actions/runs/14481627872/job/40619564626~~
I think this needs to be landed first as AO jobs still point to
`test-infra@main`
@huydhn
Copy link
Contributor

huydhn commented Apr 16, 2025

I think I have had almost everything working with 2 remaining issues:

@jainapurva jainapurva merged commit b195c57 into main Apr 17, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants