{2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New (final?) attempt to build PyTorch/2.1.2 with CUDA/12.1.1
This PR should replace previous attempts:
The PR is based on extensive testing / debugging / analysis on a VM with Haswell CPUs and NVIDIA L40S vGPUs (CUDA compute capability 8.9). It benefits from recently rebuilt CUDA/12.1.1 modules (#919) that added a directory with needed libraries to
$LIBRARY_PATH
to the module files such that the RPATH wrappers used for building software in EESSI add the necessary arguments to the linker command. After that nearly 100 tests of the PyTorch test-suite (which contains about 207k tests) still failed. Most of these 100 tests failed with an error such asbecause the library was dynamically loaded by another library of the cuDNN SDK. Fixing this issue would normally be achieved by adding the directory containing
libcudnn_cnn_train.so.8
to$LD_LIBRARY_PATH
. Because we don't want to do that in EESSI and neither want to change the binary distribution of the cuDNN SDK (which would likely violate cuDNN SDK's license), we chose to work around this by adding a dependency for the above library tolibtorch_cuda.so
which is built when installing PyTorch. This work-around is implemented in commit 4cc89fdAfter that fix, 9+1 tests still failed. These failing tests are
All failing tests were analysed individually:
Some of the failures could be reproduced when running the exact same tests as run by the test suite, but not when running the essential code from separate Python scripts. This might indicate that the issue rather lies with the test environment than with the actual codes being tested.
Some failures could be related to the specific environment being used for building (a VM with vGPUs).
All together, it seems reasonable to move forward with the changes suggested in this PR -- patching
libtorch_cuda.so
and accepting a few more failing tests. The building could be done in two steps:test_cuda_expandable_segments
) to obtain some reference for building the package on build hosts with GPUsThe latter commit also adds a sanity check that verifies that
libtorch_cuda.so
depends onlibcudnn_cnn_train.so.8