Skip to content

Commit 5116146

Browse files
authored
Upgrade NVIDIA driver for CUDA 12.6+ (#6530)
This is reported by @jainapurva from TorchAO https://github.com/pytorch/ao/actions/runs/14461343872/job/40554407661 trying to upgrade CUDA from 12.4 to 12.6 pytorch/ao#1962. It turns out that the NVIDIA driver that we are currently using `550.54.15` is too old. I grab the latest production driver from NVIDIA which should satisfy not only CUDA 12.6 but 12.8 too: * https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html This should help fix the issue on AO, also need to update this driver in a couple of other places too. ### Testing * Manual. I install the driver manually and can start the container fine without any issue `docker run --gpus all -it pytorch/almalinux-builder:cuda12.6 /bin/bash` * https://github.com/pytorch/test-infra/actions/runs/14481525016 * ~~Also test this out on AO https://github.com/pytorch/ao/actions/runs/14481627872/job/40619564626~~ I think this needs to be landed first as AO jobs still point to `test-infra@main`
1 parent bdefd18 commit 5116146

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

.github/actions/setup-nvidia/action.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ inputs:
77
description: which driver version to install
88
required: false
99
type: string
10-
default: "550.54.15" # https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-54-15/index.html
10+
default: "570.133.07" # https://www.nvidia.com/en-us/drivers/details/242273
1111

1212
runs:
1313
using: composite
@@ -85,6 +85,9 @@ runs:
8585
echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
8686
elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
8787
echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"
88+
89+
# Turn off persistent mode so that the installation script can unload the kernel module
90+
sudo killall nvidia-persistenced || true
8891
else
8992
HAS_NVIDIA_DRIVER=1
9093
echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"

0 commit comments

Comments
 (0)