Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973

Open
wants to merge 3 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented Mar 20, 2025

New (final?) attempt to build PyTorch/2.1.2 with CUDA/12.1.1

This PR should replace previous attempts:

The PR is based on extensive testing / debugging / analysis on a VM with Haswell CPUs and NVIDIA L40S vGPUs (CUDA compute capability 8.9). It benefits from recently rebuilt CUDA/12.1.1 modules (#919) that added a directory with needed libraries to $LIBRARY_PATH to the module files such that the RPATH wrappers used for building software in EESSI add the necessary arguments to the linker command. After that nearly 100 tests of the PyTorch test-suite (which contains about 207k tests) still failed. Most of these 100 tests failed with an error such as

Could not load library libcudnn_cnn_train.so.8. Error: libcudnn_cnn_train.so.8: cannot open shared object file: No such file or directory

because the library was dynamically loaded by another library of the cuDNN SDK. Fixing this issue would normally be achieved by adding the directory containing libcudnn_cnn_train.so.8 to $LD_LIBRARY_PATH. Because we don't want to do that in EESSI and neither want to change the binary distribution of the cuDNN SDK (which would likely violate cuDNN SDK's license), we chose to work around this by adding a dependency for the above library to libtorch_cuda.so which is built when installing PyTorch. This work-around is implemented in commit 4cc89fd

After that fix, 9+1 tests still failed. These failing tests are

dynamo/test_functions 1/1 (1 failed, 167 passed, 2 rerun)
dynamo/test_dynamic_shapes 1/1 (2 failed, 2065 passed, 14 skipped, 33 xfailed, 4 rerun)
distributed/elastic/utils/distributed_test 1/1 (3 failed, 4 passed, 6 rerun)
distributed/test_c10d_common 1/1 (1 unit test(s) failed)
distributed/test_c10d_gloo 1/1 (1 unit test(s) failed)
distributed/test_c10d_nccl 1/1 (1 unit test(s) failed)
+ test_cuda_expandable_segments

All failing tests were analysed individually:

  • by trying to rerun the tests in the (EasyBuild) build environment, or
  • by running essential code of the tests in the (EasyBuild) build environment.

Some of the failures could be reproduced when running the exact same tests as run by the test suite, but not when running the essential code from separate Python scripts. This might indicate that the issue rather lies with the test environment than with the actual codes being tested.

Some failures could be related to the specific environment being used for building (a VM with vGPUs).

All together, it seems reasonable to move forward with the changes suggested in this PR -- patching libtorch_cuda.so and accepting a few more failing tests. The building could be done in two steps:

  1. Build by only allowing 2 failed tests (value in the easyconfig that is available with EasyBuild/4.9.4) and not excluding one specific test (test_cuda_expandable_segments) to obtain some reference for building the package on build hosts with GPUs
  2. If the same or not more tests fail, use all suggested changes to accept failed tests and build PyTorch. The changes needed are implemented in commit 59c99a3

The latter commit also adds a sanity check that verifies that libtorch_cuda.so depends on libcudnn_cnn_train.so.8

@trz42 trz42 added 2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia labels Mar 20, 2025
Copy link

eessi-bot bot commented Mar 20, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphirerapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Mar 20, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

@eessi-bot-trz42
Copy link

Instance trz42-GH200-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@eessi-bot-toprichard
Copy link

Instance rt-Grace-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants