{2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973

trz42 · 2025-03-20T19:02:50Z

New (final?) attempt to build PyTorch/2.1.2 with CUDA/12.1.1

This PR should replace previous attempts:

The PR is based on extensive testing / debugging / analysis on a VM with Haswell CPUs and NVIDIA L40S vGPUs (CUDA compute capability 8.9). It benefits from recently rebuilt CUDA/12.1.1 modules (#919) that added a directory with needed libraries to $LIBRARY_PATH to the module files such that the RPATH wrappers used for building software in EESSI add the necessary arguments to the linker command. After that nearly 100 tests of the PyTorch test-suite (which contains about 207k tests) still failed. Most of these 100 tests failed with an error such as

Could not load library libcudnn_cnn_train.so.8. Error: libcudnn_cnn_train.so.8: cannot open shared object file: No such file or directory

because the library was dynamically loaded by another library of the cuDNN SDK. Fixing this issue would normally be achieved by adding the directory containing libcudnn_cnn_train.so.8 to $LD_LIBRARY_PATH. Because we don't want to do that in EESSI and neither want to change the binary distribution of the cuDNN SDK (which would likely violate cuDNN SDK's license), we chose to work around this by adding a dependency for the above library to libtorch_cuda.so which is built when installing PyTorch. This work-around is implemented in commit 4cc89fd

After that fix, 9+1 tests still failed. These failing tests are

dynamo/test_functions 1/1 (1 failed, 167 passed, 2 rerun)
dynamo/test_dynamic_shapes 1/1 (2 failed, 2065 passed, 14 skipped, 33 xfailed, 4 rerun)
distributed/elastic/utils/distributed_test 1/1 (3 failed, 4 passed, 6 rerun)
distributed/test_c10d_common 1/1 (1 unit test(s) failed)
distributed/test_c10d_gloo 1/1 (1 unit test(s) failed)
distributed/test_c10d_nccl 1/1 (1 unit test(s) failed)
+ test_cuda_expandable_segments

All failing tests were analysed individually:

by trying to rerun the tests in the (EasyBuild) build environment, or
by running essential code of the tests in the (EasyBuild) build environment.

Some of the failures could be reproduced when running the exact same tests as run by the test suite, but not when running the essential code from separate Python scripts. This might indicate that the issue rather lies with the test environment than with the actual codes being tested.

Some failures could be related to the specific environment being used for building (a VM with vGPUs).

All together, it seems reasonable to move forward with the changes suggested in this PR -- patching libtorch_cuda.so and accepting a few more failing tests. The building could be done in two steps:

Build by only allowing 2 failed tests (value in the easyconfig that is available with EasyBuild/4.9.4) and not excluding one specific test (test_cuda_expandable_segments) to obtain some reference for building the package on build hosts with GPUs
If the same or not more tests fail, use all suggested changes to accept failed tests and build PyTorch. The changes needed are implemented in commit 59c99a3

The latter commit also adds a sanity check that verifies that libtorch_cuda.so depends on libcudnn_cnn_train.so.8

… check for patch libtorch_cuda.so

eessi-bot · 2025-03-20T19:02:55Z

Instance eessi-bot-mc-aws is configured to build for:

architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphirerapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

eessi-bot · 2025-03-20T19:02:56Z

Instance eessi-bot-mc-azure is configured to build for:

architectures: x86_64/amd/zen4
repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

eessi-bot-trz42 · 2025-03-20T19:02:56Z

Instance trz42-GH200-jr is configured to build for:

architectures: aarch64/nvidia/grace
repositories: eessi.io-2023.06-software

eessi-bot-toprichard · 2025-03-20T19:02:56Z

Instance rt-Grace-jr is configured to build for:

architectures: aarch64/nvidia/grace
repositories: eessi.io-2023.06-software

laraPPr · 2025-04-08T09:51:53Z

bot: help

eessi-bot · 2025-04-08T09:51:55Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command help from laraPPr
- expanded format: help
handling command help resulted in:
How to send commands to bot instances
- Commands must be sent with a new comment (edits of existing comments are ignored).
- A comment may contain multiple commands, one per line.
- Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
- Currently supported COMMANDs are: help, build, show_config, status
For more information, see https://www.eessi.io/docs/bot

eessi-bot-surf · 2025-04-08T09:51:56Z

Updates by the bot instance eessi-bot-surf (click for details)

account laraPPr has NO permission to send commands to the bot

eessi-bot · 2025-04-08T09:51:56Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command help from laraPPr
- expanded format: help
handling command help resulted in:
How to send commands to bot instances
- Commands must be sent with a new comment (edits of existing comments are ignored).
- A comment may contain multiple commands, one per line.
- Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
- Currently supported COMMANDs are: help, build, show_config, status
For more information, see https://www.eessi.io/docs/bot

eessi-bot-trz42 · 2025-04-08T09:51:56Z

Updates by the bot instance trz42-GH200-jr (click for details)

account laraPPr has NO permission to send commands to the bot

eessi-bot-toprichard · 2025-04-08T09:51:56Z

Updates by the bot instance rt-Grace-jr (click for details)

account laraPPr has NO permission to send commands to the bot

gpu-bot-ugent · 2025-04-08T09:51:56Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command help from laraPPr
- expanded format: help
handling command help resulted in:
How to send commands to bot instances
- Commands must be sent with a new comment (edits of existing comments are ignored).
- A comment may contain multiple commands, one per line.
- Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
- Currently supported COMMANDs are: help, build, show_config, status
For more information, see https://www.eessi.io/docs/bot

laraPPr · 2025-04-08T09:52:16Z

bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80

eessi-bot · 2025-04-08T09:52:19Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr
- expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:
- no jobs were submitted

eessi-bot · 2025-04-08T09:52:19Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr
- expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:
- no jobs were submitted

gpu-bot-ugent · 2025-04-08T09:52:19Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr
- expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:
- submitted job 15457347, for details & status see {2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973 (comment)

eessi-bot-trz42 · 2025-04-08T09:52:19Z

Updates by the bot instance trz42-GH200-jr (click for details)

account laraPPr has NO permission to send commands to the bot

eessi-bot-surf · 2025-04-08T09:52:19Z

Updates by the bot instance eessi-bot-surf (click for details)

account laraPPr has NO permission to send commands to the bot

eessi-bot-toprichard · 2025-04-08T09:52:19Z

Updates by the bot instance rt-Grace-jr (click for details)

account laraPPr has NO permission to send commands to the bot

gpu-bot-ugent · 2025-04-08T09:52:23Z

The build Seemed to have gone fine. And the test-suite failed because I seem to have done something wrong when updating the reframe_config.

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2025.04/pr_973/15457347

date	job status	comment
Apr 08 09:52:22 UTC 2025	submitted	job id `15457347` awaits release by job manager
Apr 08 09:52:52 UTC 2025	released	job awaits launch by Slurm scheduler
Apr 08 22:45:35 UTC 2025	running	job `15457347` is running
Apr 09 06:24:26 UTC 2025	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-15457347.out` ✅ no message matching `FATAL:` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen3-1744179019.tar.gz` size: 508 MiB (533708474 bytes) entries: 12854 modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all `magma/2.7.2-foss-2023a-CUDA-12.1.1.lua` `PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software `magma/2.7.2-foss-2023a-CUDA-12.1.1` `PyTorch/2.1.2-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80 `2023.06/init/easybuild/eb_hooks.py`
Apr 09 06:24:27 UTC 2025	test result	😢 FAILURE (click triangle for details) Reason EESSI test suite was not run, test step itself failed to execute. Details ✅ job output file `slurm-15457347.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2025-04-08T17:45:53Z

Running a test build on Snellius. Since this goes to a zen4 partition the build may include a couple of additional packages...
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

eessi-bot · 2025-04-08T17:45:58Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

eessi-bot-surf · 2025-04-08T17:45:58Z

Updates by the bot instance eessi-bot-surf (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- submitted job 11089764, for details & status see {2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973 (comment)

eessi-bot-toprichard · 2025-04-08T17:45:59Z

Updates by the bot instance rt-Grace-jr (click for details)

account trz42 has NO permission to send commands to the bot

eessi-bot-trz42 · 2025-04-08T17:45:59Z

Updates by the bot instance trz42-GH200-jr (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

eessi-bot · 2025-04-08T17:46:02Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

eessi-bot-toprichard · 2025-04-09T08:14:32Z

Updates by the bot instance rt-Grace-jr (click for details)

account laraPPr has NO permission to send commands to the bot

eessi-bot-surf · 2025-04-09T08:14:32Z

Updates by the bot instance eessi-bot-surf (click for details)

account laraPPr has NO permission to send commands to the bot

eessi-bot-trz42 · 2025-04-09T08:14:32Z

Updates by the bot instance trz42-GH200-jr (click for details)

account laraPPr has NO permission to send commands to the bot

gpu-bot-ugent · 2025-04-09T08:14:36Z

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2025.04/pr_973/15457509

date	job status	comment
Apr 09 08:14:35 UTC 2025	submitted	job id `15457509` awaits release by job manager
Apr 09 08:16:37 UTC 2025	released	job awaits launch by Slurm scheduler
Apr 09 08:44:42 UTC 2025	running	job `15457509` is running
Apr 09 16:18:40 UTC 2025	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-15457509.out` ✅ no message matching `FATAL:` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen3-1744214532.tar.gz` size: 508 MiB (533693476 bytes) entries: 12854 modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all `magma/2.7.2-foss-2023a-CUDA-12.1.1.lua` `PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software `magma/2.7.2-foss-2023a-CUDA-12.1.1` `PyTorch/2.1.2-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80 `2023.06/init/easybuild/eb_hooks.py`
Apr 09 16:18:40 UTC 2025	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ SKIP ] (1/9) Skipping GPU test : only 1 GPU available for this test case [ SKIP ] (2/9) Skipping GPU test : only 1 GPU available for this test case [ SKIP ] (3/9) Skipping GPU test : only 1 GPU available for this test case [ SKIP ] (4/9) Skipping GPU test : only 1 GPU available for this test case [ SKIP ] (5/9) Skipping test : 1 GPU(s) available for this test case, need exactly 2 [ SKIP ] (6/9) Skipping test : 1 GPU(s) available for this test case, need exactly 2 [ SKIP ] (7/9) Skipping test : 1 GPU(s) available for this test case, need exactly 2 [ SKIP ] (8/9) Skipping test : 1 GPU(s) available for this test case, need exactly 2 [ OK ] (9/9) EESSI_LAMMPS_lj %device_type=gpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1 %scale=1_4_node /497af4b1 @BotBuildTests:x86_64_amd_zen3_nvidia_cc80+default P: perf: 4374.671 timesteps/s (r:0, l:None, u:None) [ PASSED ] Ran 1/9 test case(s) from 9 check(s) (0 failure(s), 8 skipped, 0 aborted) Details ✅ job output file `slurm-15457509.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2025-04-09T08:46:55Z

Next attempt to build on Snellius after extended walltime limit. Since this goes to a zen4 partition the build may include a couple of additional packages...
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

eessi-bot · 2025-04-09T08:46:58Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

eessi-bot · 2025-04-09T08:46:58Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

eessi-bot-surf · 2025-04-09T08:46:59Z

Updates by the bot instance eessi-bot-surf (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- submitted job 11095064, for details & status see {2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973 (comment)

eessi-bot-trz42 · 2025-04-09T08:46:59Z

Updates by the bot instance trz42-GH200-jr (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

gpu-bot-ugent · 2025-04-09T08:46:59Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

eessi-bot-toprichard · 2025-04-09T08:46:59Z

Updates by the bot instance rt-Grace-jr (click for details)

account trz42 has NO permission to send commands to the bot

eessi-bot-surf · 2025-04-09T08:47:04Z

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.04/pr_973/11095064

date	job status	comment
Apr 09 08:47:02 UTC 2025	submitted	job id `11095064` will be eligible to start in about 20 seconds
Apr 09 08:47:16 UTC 2025	received	job awaits launch by Slurm scheduler
Apr 09 08:47:30 UTC 2025	running	job `11095064` is running
Apr 10 08:53:19 UTC 2025	finished	😢 FAILURE (click triangle for details) Details ✅ job output file `slurm-11095064.out` ✅ no message matching `FATAL:` ❌ found message matching `ERROR:` ❌ found message matching `FAILED:` ❌ found message matching `required modules missing:` ❌ no message matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen4-1744273357.tar.gz` size: 1088 MiB (1140893818 bytes) entries: 307 modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all `LightGBM/4.5.0-foss-2023a-CUDA-12.1.1.lua` `NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1.lua` `UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua` `cuDNN/8.9.2.26-CUDA-12.1.1.lua` `magma/2.7.2-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software `LightGBM/4.5.0-foss-2023a-CUDA-12.1.1` `NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1` `UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1` `cuDNN/8.9.2.26-CUDA-12.1.1` `magma/2.7.2-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90 `2023.06/init/easybuild/eb_hooks.py`
Apr 10 08:53:19 UTC 2025	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-11095064.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2025-04-10T07:17:18Z

Try building on NVIDIA Grace/Hopper
bot: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes

eessi-bot · 2025-04-10T07:17:22Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

eessi-bot · 2025-04-10T07:17:22Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

gpu-bot-ugent · 2025-04-10T07:17:22Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

eessi-bot-toprichard · 2025-04-10T07:17:22Z

Updates by the bot instance rt-Grace-jr (click for details)

account trz42 has NO permission to send commands to the bot

eessi-bot-surf · 2025-04-10T07:17:22Z

Updates by the bot instance eessi-bot-surf (click for details)

received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes from trz42
- expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes
handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes resulted in:
- no jobs were submitted

eessi-bot-trz42 · 2025-04-10T07:17:22Z

Updates by the bot instance trz42-GH200-jr (click for details)

received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes from trz42
- expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes
handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes resulted in:
- submitted job 13569211, for details & status see {2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973 (comment)

eessi-bot-trz42 · 2025-04-10T07:17:28Z

New job on instance trz42-GH200-jr for CPU micro-architecture aarch64-nvidia-grace and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /p/project1/ceasybuilders/bot-trz42/jobs/2025.04/pr_973/13569211

date	job status	comment
Apr 10 07:17:26 UTC 2025	submitted	job id `13569211` awaits release by job manager
Apr 10 07:18:30 UTC 2025	released	job awaits launch by Slurm scheduler
Apr 10 07:19:34 UTC 2025	running	job `13569211` is running
Apr 10 08:36:39 UTC 2025	finished	😢 FAILURE (click triangle for details) Details ✅ job output file `slurm-13569211.out` ✅ no message matching `FATAL:` ❌ found message matching `ERROR:` ❌ found message matching `FAILED:` ❌ found message matching `required modules missing:` ❌ no message matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-aarch64-nvidia-grace-1744273694.tar.gz` size: 301 MiB (316387868 bytes) entries: 114 modules under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/modules/all `magma/2.7.2-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software `magma/2.7.2-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90 `2023.06/init/easybuild/eb_hooks.py`
Apr 10 08:36:39 UTC 2025	test result	😢 FAILURE (click triangle for details) Reason EESSI test suite was not run, test step itself failed to execute. Details ✅ job output file `slurm-13569211.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2025-04-11T04:44:46Z

Build again on NVIDIA Grace/Hopper after fixing patch issue (path to library needs to take CPU family into account)...
bot: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes

eessi-bot · 2025-04-11T04:44:50Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

eessi-bot-surf · 2025-04-11T04:44:50Z

Updates by the bot instance eessi-bot-surf (click for details)

received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes from trz42
- expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes
handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes resulted in:
- no jobs were submitted

eessi-bot · 2025-04-11T04:44:50Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

eessi-bot-toprichard · 2025-04-11T04:44:50Z

Updates by the bot instance rt-Grace-jr (click for details)

account trz42 has NO permission to send commands to the bot

eessi-bot-trz42 · 2025-04-11T04:44:50Z

Updates by the bot instance trz42-GH200-jr (click for details)

received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes from trz42
- expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes
handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes resulted in:
- submitted job 13573956, for details & status see {2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973 (comment)

eessi-bot-trz42 · 2025-04-11T04:44:55Z

New job on instance trz42-GH200-jr for CPU micro-architecture aarch64-nvidia-grace and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /p/project1/ceasybuilders/bot-trz42/jobs/2025.04/pr_973/13573956

this time patching worked

== 2025-04-11 07:32:51,607 run.py:700 INFO cmd "readelf -d /tmp/USER/easybuild/build/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/build/lib.linux-aarch64-cpython-311/torch/lib/libtorch_cuda.so" exited with exit code 0 and output:

Dynamic section at offset 0xa359178 contains 46 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libcudnn_cnn_train.so.8]
 0x0000000000000001 (NEEDED)             Shared library: [libc10_cuda.so]
 0x0000000000000001 (NEEDED)             Shared library: [libcudart.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcusparse.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcurand.so.10]
 0x0000000000000001 (NEEDED)             Shared library: [libcufft.so.11]
 0x0000000000000001 (NEEDED)             Shared library: [libnvToolsExt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libcudnn.so.8]
 0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libibverbs.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.40]
 0x0000000000000001 (NEEDED)             Shared library: [libc10.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtorch_cpu.so]
 0x0000000000000001 (NEEDED)             Shared library: [libcublas.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcublasLt.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libtorch_cuda.so]

build job is still running PyTorch unit tests... eventually failed because too many tests failed

dynamo/test_functions 1/1 (1 failed, 167 passed, 2 rerun)
dynamo/test_dynamic_shapes 1/1 (2 failed, 2065 passed, 14 skipped, 33 xfailed, 4 rerun)
test_model_dump 1/1 (2 failed, 6 passed, 1 skipped, 4 rerun)
test_ops 1/1 (2 failed, 20693 passed, 8497 skipped, 324 xfailed, 4 rerun)
test_optim 1/1 (2 failed, 182 passed, 2 skipped, 4 rerun)
test_scatter_gather_ops 1/1 (1 failed, 80 passed, 2 rerun)
test_cuda 1/1 (5 failed, 132 passed, 12 skipped, 2 xfailed, 10 rerun)
distributed/rpc/cuda/test_tensorpipe_agent 1/1 (1 unit test(s) failed)
distributed/rpc/test_faulty_agent 1/1 (1 unit test(s) failed)
distributed/rpc/test_share_memory 1/1 (1 unit test(s) failed)
distributed/test_store 1/1 (1 unit test(s) failed)

test_model_dump failed with

RuntimeError: Didn't find engine for operation quantized::linear_prepack NoQEngine

test_ops failed for

FAILED [0.0573s] test_ops.py::TestCommonCPU::test_python_ref__refs_square_cpu_complex64
FAILED [0.0539s] test_ops.py::TestCommonCPU::test_python_ref_torch_fallback__refs_square_cpu_complex64

with

AssertionError: tensor(False) is not true : Reference result was farther (1.3385259018293323) from the precise computation than the torch result was (1.338523206207303)!
AssertionError: tensor(False) is not true : Reference result was farther (1.3385259018293323) from the precise computation than the torch result was (1.338523206207303)!

test_scatter_gather_ops failed for

FAILED [0.0092s] test_scatter_gather_ops.py::TestScatterGatherCPU::test_scatter_reduce_prod_cpu_complex128

with

AssertionError: Tensor-likes are not equal!

Mismatched elements: 132 / 1870 (7.1%)
Greatest absolute difference: 1.2710574864626038e-13 at index (0, 9, 2)
Greatest relative difference: 1.9521439324923405e-16 at index (9, 5, 3)

test_cuda failed for

FAILED [0.0012s] test_cuda.py::TestCudaMallocAsync::test_cycles - RuntimeErro...
FAILED [0.0015s] test_cuda.py::TestCudaMallocAsync::test_direct_traceback - R...
FAILED [0.0012s] test_cuda.py::TestCudaMallocAsync::test_memory_plots - Runti...
FAILED [0.0011s] test_cuda.py::TestCudaMallocAsync::test_memory_plots_free_stack
FAILED [0.0012s] test_cuda.py::TestCudaMallocAsync::test_memory_snapshot_with_cpp

distributed/rpc/cuda/test_tensorpipe_agent failed with

RuntimeError: In getBar1SizeOfGpu at tensorpipe/channel/cuda_gdr/context_impl.cc:242 "": No such file or directory

distributed/rpc/test_faulty_agent failed with

RuntimeError: In getBar1SizeOfGpu at tensorpipe/channel/cuda_gdr/context_impl.cc:242 "": No such file or directory

distributed/rpc/test_share_memory failed with

distributed/rpc/test_share_memory.py::TestRPCPickler::test_case [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
/tmp/eb-0b0w_j0o/eb-1ozjj0ui/tmpgw_4g3yy/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:604: UserWarning: You are using a Backend <class 'torch.distributed.distributed_c10d.ProcessGroupGloo'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
  warnings.warn(


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/threading.py:320: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
====================== no tests ran in 903.24s (0:15:03) =======================

distributed/test_store failed for

FAILED [0.0005s] distributed/test_store.py::FileStoreTest::test_init_pg_and_rpc_with_same_file

with

RuntimeError: RPC is already initialized

date	job status	comment
Apr 11 04:44:54 UTC 2025	submitted	job id `13573956` awaits release by job manager
Apr 11 04:45:20 UTC 2025	released	job awaits launch by Slurm scheduler
Apr 11 04:46:24 UTC 2025	running	job `13573956` is running
Apr 11 11:00:41 UTC 2025	finished	😢 FAILURE (click triangle for details) Details ✅ job output file `slurm-13573956.out` ✅ no message matching `FATAL:` ❌ found message matching `ERROR:` ❌ found message matching `FAILED:` ❌ found message matching `required modules missing:` ❌ no message matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-aarch64-nvidia-grace-1744368529.tar.gz` size: 301 MiB (316399772 bytes) entries: 114 modules under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/modules/all `magma/2.7.2-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software `magma/2.7.2-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90 `2023.06/init/easybuild/eb_hooks.py`
Apr 11 11:00:41 UTC 2025	test result	😢 FAILURE (click triangle for details) Reason EESSI test suite was not run, test step itself failed to execute. Details ✅ job output file `slurm-13573956.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

…-layer into 2023.06-PyTorch-2.1.2-with-CUDA-foss-2023a-final

laraPPr · 2025-06-27T14:41:18Z

@trz42 Can you split up and retarget this pr?

truib added 3 commits March 20, 2025 17:53

{2023.06}[2023a] PyTorch v2.1.2 w/ CUDA/12.1.1

a6123e6

post build hook to add dependency to libtorch_cuda.so

4cc89fd

tweak PyTorch easyconfig to tolerate more failed tests and add sanity…

59c99a3

… check for patch libtorch_cuda.so

trz42 added 2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia labels Mar 20, 2025

distinguish between CPU families in path

6ee2343

Merge branch '2023.06-software.eessi.io' of github-trz:EESSI/software…

6f9ab47

…-layer into 2023.06-PyTorch-2.1.2-with-CUDA-foss-2023a-final

{2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973

Are you sure you want to change the base?

{2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973

Uh oh!

Conversation

trz42 commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot bot commented Mar 20, 2025

Uh oh!

eessi-bot bot commented Mar 20, 2025

Uh oh!

eessi-bot-trz42 bot commented Mar 20, 2025

Uh oh!

eessi-bot-toprichard bot commented Mar 20, 2025

Uh oh!

laraPPr commented Apr 8, 2025

Uh oh!

eessi-bot bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot-surf bot commented Apr 8, 2025

Uh oh!

eessi-bot bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot-trz42 bot commented Apr 8, 2025

Uh oh!

eessi-bot-toprichard bot commented Apr 8, 2025

Uh oh!

gpu-bot-ugent bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laraPPr commented Apr 8, 2025

Uh oh!

eessi-bot bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gpu-bot-ugent bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot-trz42 bot commented Apr 8, 2025

Uh oh!

eessi-bot-surf bot commented Apr 8, 2025

Uh oh!

eessi-bot-toprichard bot commented Apr 8, 2025

Uh oh!

gpu-bot-ugent bot commented Apr 8, 2025 • edited by laraPPr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trz42 commented Apr 8, 2025

Uh oh!

eessi-bot bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot-surf bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot-toprichard bot commented Apr 8, 2025

Uh oh!

eessi-bot-trz42 bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot-toprichard bot commented Apr 9, 2025

Uh oh!

eessi-bot-surf bot commented Apr 9, 2025

Uh oh!

eessi-bot-trz42 bot commented Apr 9, 2025

Uh oh!

gpu-bot-ugent bot commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trz42 commented Apr 9, 2025

Uh oh!

trz42 commented Mar 20, 2025 •

edited

Loading

eessi-bot bot commented Apr 8, 2025 •

edited

Loading

eessi-bot bot commented Apr 8, 2025 •

edited

Loading

gpu-bot-ugent bot commented Apr 8, 2025 •

edited

Loading

eessi-bot bot commented Apr 8, 2025 •

edited

Loading

eessi-bot bot commented Apr 8, 2025 •

edited

Loading

gpu-bot-ugent bot commented Apr 8, 2025 •

edited

Loading

gpu-bot-ugent bot commented Apr 8, 2025 •

edited by laraPPr

Loading

eessi-bot bot commented Apr 8, 2025 •

edited

Loading

eessi-bot-surf bot commented Apr 8, 2025 •

edited

Loading

eessi-bot-trz42 bot commented Apr 8, 2025 •

edited

Loading

eessi-bot bot commented Apr 8, 2025 •

edited

Loading

gpu-bot-ugent bot commented Apr 9, 2025 •

edited

Loading

eessi-bot bot commented Apr 9, 2025 •

edited

Loading

eessi-bot bot commented Apr 9, 2025 •

edited

Loading

eessi-bot-surf bot commented Apr 9, 2025 •

edited

Loading

eessi-bot-trz42 bot commented Apr 9, 2025 •

edited

Loading

gpu-bot-ugent bot commented Apr 9, 2025 •

edited

Loading

eessi-bot-surf bot commented Apr 9, 2025 •

edited

Loading

eessi-bot-surf bot commented Apr 10, 2025 •

edited

Loading

eessi-bot-trz42 bot commented Apr 10, 2025 •

edited

Loading

eessi-bot-trz42 bot commented Apr 10, 2025 •

edited

Loading

eessi-bot-surf bot commented Apr 11, 2025 •

edited

Loading

eessi-bot-trz42 bot commented Apr 11, 2025 •

edited

Loading

eessi-bot-trz42 bot commented Apr 11, 2025 •

edited by trz42

Loading