-
Notifications
You must be signed in to change notification settings - Fork 61
{2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 2023.06-software.eessi.io
Are you sure you want to change the base?
{2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973
Conversation
Instance
|
Instance
|
Instance
|
Instance
|
bot: help |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
The build Seemed to have gone fine. And the test-suite failed because I seem to have done something wrong when updating the reframe_config. New job on instance
|
Running a test build on Snellius. Since this goes to a |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Next attempt to build on Snellius after extended walltime limit. Since this goes to a zen4 partition the build may include a couple of additional packages... |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Try building on NVIDIA Grace/Hopper |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Build again on NVIDIA Grace/Hopper after fixing patch issue (path to library needs to take CPU family into account)... |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
…-layer into 2023.06-PyTorch-2.1.2-with-CUDA-foss-2023a-final
New (final?) attempt to build PyTorch/2.1.2 with CUDA/12.1.1
This PR should replace previous attempts:
The PR is based on extensive testing / debugging / analysis on a VM with Haswell CPUs and NVIDIA L40S vGPUs (CUDA compute capability 8.9). It benefits from recently rebuilt CUDA/12.1.1 modules (#919) that added a directory with needed libraries to
$LIBRARY_PATH
to the module files such that the RPATH wrappers used for building software in EESSI add the necessary arguments to the linker command. After that nearly 100 tests of the PyTorch test-suite (which contains about 207k tests) still failed. Most of these 100 tests failed with an error such asbecause the library was dynamically loaded by another library of the cuDNN SDK. Fixing this issue would normally be achieved by adding the directory containing
libcudnn_cnn_train.so.8
to$LD_LIBRARY_PATH
. Because we don't want to do that in EESSI and neither want to change the binary distribution of the cuDNN SDK (which would likely violate cuDNN SDK's license), we chose to work around this by adding a dependency for the above library tolibtorch_cuda.so
which is built when installing PyTorch. This work-around is implemented in commit 4cc89fdAfter that fix, 9+1 tests still failed. These failing tests are
All failing tests were analysed individually:
Some of the failures could be reproduced when running the exact same tests as run by the test suite, but not when running the essential code from separate Python scripts. This might indicate that the issue rather lies with the test environment than with the actual codes being tested.
Some failures could be related to the specific environment being used for building (a VM with vGPUs).
All together, it seems reasonable to move forward with the changes suggested in this PR -- patching
libtorch_cuda.so
and accepting a few more failing tests. The building could be done in two steps:test_cuda_expandable_segments
) to obtain some reference for building the package on build hosts with GPUsThe latter commit also adds a sanity check that verifies that
libtorch_cuda.so
depends onlibcudnn_cnn_train.so.8