Skip to content

{2023.06}[2023a] PyTorch v2.1.2 with CUDA/12.1.1 #973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented Mar 20, 2025

New (final?) attempt to build PyTorch/2.1.2 with CUDA/12.1.1

This PR should replace previous attempts:

The PR is based on extensive testing / debugging / analysis on a VM with Haswell CPUs and NVIDIA L40S vGPUs (CUDA compute capability 8.9). It benefits from recently rebuilt CUDA/12.1.1 modules (#919) that added a directory with needed libraries to $LIBRARY_PATH to the module files such that the RPATH wrappers used for building software in EESSI add the necessary arguments to the linker command. After that nearly 100 tests of the PyTorch test-suite (which contains about 207k tests) still failed. Most of these 100 tests failed with an error such as

Could not load library libcudnn_cnn_train.so.8. Error: libcudnn_cnn_train.so.8: cannot open shared object file: No such file or directory

because the library was dynamically loaded by another library of the cuDNN SDK. Fixing this issue would normally be achieved by adding the directory containing libcudnn_cnn_train.so.8 to $LD_LIBRARY_PATH. Because we don't want to do that in EESSI and neither want to change the binary distribution of the cuDNN SDK (which would likely violate cuDNN SDK's license), we chose to work around this by adding a dependency for the above library to libtorch_cuda.so which is built when installing PyTorch. This work-around is implemented in commit 4cc89fd

After that fix, 9+1 tests still failed. These failing tests are

dynamo/test_functions 1/1 (1 failed, 167 passed, 2 rerun)
dynamo/test_dynamic_shapes 1/1 (2 failed, 2065 passed, 14 skipped, 33 xfailed, 4 rerun)
distributed/elastic/utils/distributed_test 1/1 (3 failed, 4 passed, 6 rerun)
distributed/test_c10d_common 1/1 (1 unit test(s) failed)
distributed/test_c10d_gloo 1/1 (1 unit test(s) failed)
distributed/test_c10d_nccl 1/1 (1 unit test(s) failed)
+ test_cuda_expandable_segments

All failing tests were analysed individually:

  • by trying to rerun the tests in the (EasyBuild) build environment, or
  • by running essential code of the tests in the (EasyBuild) build environment.

Some of the failures could be reproduced when running the exact same tests as run by the test suite, but not when running the essential code from separate Python scripts. This might indicate that the issue rather lies with the test environment than with the actual codes being tested.

Some failures could be related to the specific environment being used for building (a VM with vGPUs).

All together, it seems reasonable to move forward with the changes suggested in this PR -- patching libtorch_cuda.so and accepting a few more failing tests. The building could be done in two steps:

  1. Build by only allowing 2 failed tests (value in the easyconfig that is available with EasyBuild/4.9.4) and not excluding one specific test (test_cuda_expandable_segments) to obtain some reference for building the package on build hosts with GPUs
  2. If the same or not more tests fail, use all suggested changes to accept failed tests and build PyTorch. The changes needed are implemented in commit 59c99a3

The latter commit also adds a sanity check that verifies that libtorch_cuda.so depends on libcudnn_cnn_train.so.8

@trz42 trz42 added 2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia labels Mar 20, 2025
Copy link

eessi-bot bot commented Mar 20, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphirerapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Mar 20, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

@eessi-bot-trz42
Copy link

Instance trz42-GH200-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@eessi-bot-toprichard
Copy link

Instance rt-Grace-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@laraPPr
Copy link
Collaborator

laraPPr commented Apr 8, 2025

bot: help

Copy link

eessi-bot bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@eessi-bot-trz42
Copy link

Updates by the bot instance trz42-GH200-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@laraPPr
Copy link
Collaborator

laraPPr commented Apr 8, 2025

bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80

Copy link

eessi-bot bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

@eessi-bot-trz42
Copy link

Updates by the bot instance trz42-GH200-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • account laraPPr has NO permission to send commands to the bot

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 8, 2025

The build Seemed to have gone fine. And the test-suite failed because I seem to have done something wrong when updating the reframe_config.

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2025.04/pr_973/15457347

date job status comment
Apr 08 09:52:22 UTC 2025 submitted job id 15457347 awaits release by job manager
Apr 08 09:52:52 UTC 2025 released job awaits launch by Slurm scheduler
Apr 08 22:45:35 UTC 2025 running job 15457347 is running
Apr 09 06:24:26 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-15457347.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1744179019.tar.gzsize: 508 MiB (533708474 bytes)
entries: 12854
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
magma/2.7.2-foss-2023a-CUDA-12.1.1
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
2023.06/init/easybuild/eb_hooks.py
Apr 09 06:24:27 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-15457347.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented Apr 8, 2025

Running a test build on Snellius. Since this goes to a zen4 partition the build may include a couple of additional packages...
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

Copy link

eessi-bot bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account trz42 has NO permission to send commands to the bot

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Apr 8, 2025

Updates by the bot instance trz42-GH200-jr (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Apr 8, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Apr 9, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 9, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • account laraPPr has NO permission to send commands to the bot

@eessi-bot-trz42
Copy link

Updates by the bot instance trz42-GH200-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 9, 2025

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2025.04/pr_973/15457509

date job status comment
Apr 09 08:14:35 UTC 2025 submitted job id 15457509 awaits release by job manager
Apr 09 08:16:37 UTC 2025 released job awaits launch by Slurm scheduler
Apr 09 08:44:42 UTC 2025 running job 15457509 is running
Apr 09 16:18:40 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-15457509.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1744214532.tar.gzsize: 508 MiB (533693476 bytes)
entries: 12854
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
magma/2.7.2-foss-2023a-CUDA-12.1.1
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
2023.06/init/easybuild/eb_hooks.py
Apr 09 16:18:40 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/9) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/9) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/9) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/9) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/9) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/9) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/9) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/9) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ OK ] (9/9) EESSI_LAMMPS_lj %device_type=gpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1 %scale=1_4_node /497af4b1 @BotBuildTests:x86_64_amd_zen3_nvidia_cc80+default
P: perf: 4374.671 timesteps/s (r:0, l:None, u:None)
[ PASSED ] Ran 1/9 test case(s) from 9 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-15457509.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented Apr 9, 2025

Next attempt to build on Snellius after extended walltime limit. Since this goes to a zen4 partition the build may include a couple of additional packages...
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

Copy link

eessi-bot bot commented Apr 9, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Apr 9, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 9, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Apr 9, 2025

Updates by the bot instance trz42-GH200-jr (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 9, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account trz42 has NO permission to send commands to the bot

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 9, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.04/pr_973/11095064

date job status comment
Apr 09 08:47:02 UTC 2025 submitted job id 11095064 will be eligible to start in about 20 seconds
Apr 09 08:47:16 UTC 2025 received job awaits launch by Slurm scheduler
Apr 09 08:47:30 UTC 2025 running job 11095064 is running
Apr 10 08:53:19 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11095064.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-1744273357.tar.gzsize: 1088 MiB (1140893818 bytes)
entries: 307
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
LightGBM/4.5.0-foss-2023a-CUDA-12.1.1.lua
NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1.lua
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
LightGBM/4.5.0-foss-2023a-CUDA-12.1.1
NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1
cuDNN/8.9.2.26-CUDA-12.1.1
magma/2.7.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
Apr 10 08:53:19 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11095064.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented Apr 10, 2025

Try building on NVIDIA Grace/Hopper
bot: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes

Copy link

eessi-bot bot commented Apr 10, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

Copy link

eessi-bot bot commented Apr 10, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 10, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account trz42 has NO permission to send commands to the bot

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 10, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes resulted in:

    • no jobs were submitted

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Apr 10, 2025

Updates by the bot instance trz42-GH200-jr (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes resulted in:

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Apr 10, 2025

New job on instance trz42-GH200-jr for CPU micro-architecture aarch64-nvidia-grace and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /p/project1/ceasybuilders/bot-trz42/jobs/2025.04/pr_973/13569211

date job status comment
Apr 10 07:17:26 UTC 2025 submitted job id 13569211 awaits release by job manager
Apr 10 07:18:30 UTC 2025 released job awaits launch by Slurm scheduler
Apr 10 07:19:34 UTC 2025 running job 13569211 is running
Apr 10 08:36:39 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13569211.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-nvidia-grace-1744273694.tar.gzsize: 301 MiB (316387868 bytes)
entries: 114
modules under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/modules/all
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software
magma/2.7.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
Apr 10 08:36:39 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-13569211.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented Apr 11, 2025

Build again on NVIDIA Grace/Hopper after fixing patch issue (path to library needs to take CPU family into account)...
bot: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes

Copy link

eessi-bot bot commented Apr 11, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 11, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Apr 11, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • parsing the bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes, received from sender trz42, failed

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account trz42 has NO permission to send commands to the bot

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Apr 11, 2025

Updates by the bot instance trz42-GH200-jr (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace accelerator:nvidia/cc90 exportvariable:SKIP_TESTS=yes resulted in:

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Apr 11, 2025

New job on instance trz42-GH200-jr for CPU micro-architecture aarch64-nvidia-grace and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /p/project1/ceasybuilders/bot-trz42/jobs/2025.04/pr_973/13573956

  • this time patching worked
    == 2025-04-11 07:32:51,607 run.py:700 INFO cmd "readelf -d /tmp/USER/easybuild/build/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/build/lib.linux-aarch64-cpython-311/torch/lib/libtorch_cuda.so" exited with exit code 0 and output:
    
    Dynamic section at offset 0xa359178 contains 46 entries:
      Tag        Type                         Name/Value
     0x0000000000000001 (NEEDED)             Shared library: [libcudnn_cnn_train.so.8]
     0x0000000000000001 (NEEDED)             Shared library: [libc10_cuda.so]
     0x0000000000000001 (NEEDED)             Shared library: [libcudart.so.12]
     0x0000000000000001 (NEEDED)             Shared library: [libcusparse.so.12]
     0x0000000000000001 (NEEDED)             Shared library: [libcurand.so.10]
     0x0000000000000001 (NEEDED)             Shared library: [libcufft.so.11]
     0x0000000000000001 (NEEDED)             Shared library: [libnvToolsExt.so.1]
     0x0000000000000001 (NEEDED)             Shared library: [libcudnn.so.8]
     0x0000000000000001 (NEEDED)             Shared library: [libnccl.so.2]
     0x0000000000000001 (NEEDED)             Shared library: [libibverbs.so.1]
     0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.40]
     0x0000000000000001 (NEEDED)             Shared library: [libc10.so]
     0x0000000000000001 (NEEDED)             Shared library: [libtorch_cpu.so]
     0x0000000000000001 (NEEDED)             Shared library: [libcublas.so.12]
     0x0000000000000001 (NEEDED)             Shared library: [libcublasLt.so.12]
     0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
     0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
     0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
     0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
     0x000000000000000e (SONAME)             Library soname: [libtorch_cuda.so]
    
  • build job is still running PyTorch unit tests... eventually failed because too many tests failed
    dynamo/test_functions 1/1 (1 failed, 167 passed, 2 rerun)
    dynamo/test_dynamic_shapes 1/1 (2 failed, 2065 passed, 14 skipped, 33 xfailed, 4 rerun)
    test_model_dump 1/1 (2 failed, 6 passed, 1 skipped, 4 rerun)
    test_ops 1/1 (2 failed, 20693 passed, 8497 skipped, 324 xfailed, 4 rerun)
    test_optim 1/1 (2 failed, 182 passed, 2 skipped, 4 rerun)
    test_scatter_gather_ops 1/1 (1 failed, 80 passed, 2 rerun)
    test_cuda 1/1 (5 failed, 132 passed, 12 skipped, 2 xfailed, 10 rerun)
    distributed/rpc/cuda/test_tensorpipe_agent 1/1 (1 unit test(s) failed)
    distributed/rpc/test_faulty_agent 1/1 (1 unit test(s) failed)
    distributed/rpc/test_share_memory 1/1 (1 unit test(s) failed)
    distributed/test_store 1/1 (1 unit test(s) failed)
    
  • test_model_dump failed with
    RuntimeError: Didn't find engine for operation quantized::linear_prepack NoQEngine
    
  • test_ops failed for
    FAILED [0.0573s] test_ops.py::TestCommonCPU::test_python_ref__refs_square_cpu_complex64
    FAILED [0.0539s] test_ops.py::TestCommonCPU::test_python_ref_torch_fallback__refs_square_cpu_complex64
    
    with
    AssertionError: tensor(False) is not true : Reference result was farther (1.3385259018293323) from the precise computation than the torch result was (1.338523206207303)!
    AssertionError: tensor(False) is not true : Reference result was farther (1.3385259018293323) from the precise computation than the torch result was (1.338523206207303)!
    
  • test_scatter_gather_ops failed for
    FAILED [0.0092s] test_scatter_gather_ops.py::TestScatterGatherCPU::test_scatter_reduce_prod_cpu_complex128
    
    with
    AssertionError: Tensor-likes are not equal!
    
    Mismatched elements: 132 / 1870 (7.1%)
    Greatest absolute difference: 1.2710574864626038e-13 at index (0, 9, 2)
    Greatest relative difference: 1.9521439324923405e-16 at index (9, 5, 3)
    
  • test_cuda failed for
    FAILED [0.0012s] test_cuda.py::TestCudaMallocAsync::test_cycles - RuntimeErro...
    FAILED [0.0015s] test_cuda.py::TestCudaMallocAsync::test_direct_traceback - R...
    FAILED [0.0012s] test_cuda.py::TestCudaMallocAsync::test_memory_plots - Runti...
    FAILED [0.0011s] test_cuda.py::TestCudaMallocAsync::test_memory_plots_free_stack
    FAILED [0.0012s] test_cuda.py::TestCudaMallocAsync::test_memory_snapshot_with_cpp
    
  • distributed/rpc/cuda/test_tensorpipe_agent failed with
    RuntimeError: In getBar1SizeOfGpu at tensorpipe/channel/cuda_gdr/context_impl.cc:242 "": No such file or directory
    
  • distributed/rpc/test_faulty_agent failed with
    RuntimeError: In getBar1SizeOfGpu at tensorpipe/channel/cuda_gdr/context_impl.cc:242 "": No such file or directory
    
  • distributed/rpc/test_share_memory failed with
    distributed/rpc/test_share_memory.py::TestRPCPickler::test_case [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
    [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
    [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
    [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
    [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
    [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
    [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
    /tmp/eb-0b0w_j0o/eb-1ozjj0ui/tmpgw_4g3yy/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:604: UserWarning: You are using a Backend <class 'torch.distributed.distributed_c10d.ProcessGroupGloo'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
      warnings.warn(
    
    
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/threading.py:320: KeyboardInterrupt
    (to show a full traceback on KeyboardInterrupt use --full-trace)
    ====================== no tests ran in 903.24s (0:15:03) =======================
    
  • distributed/test_store failed for
    FAILED [0.0005s] distributed/test_store.py::FileStoreTest::test_init_pg_and_rpc_with_same_file
    
    with
    RuntimeError: RPC is already initialized
    
date job status comment
Apr 11 04:44:54 UTC 2025 submitted job id 13573956 awaits release by job manager
Apr 11 04:45:20 UTC 2025 released job awaits launch by Slurm scheduler
Apr 11 04:46:24 UTC 2025 running job 13573956 is running
Apr 11 11:00:41 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13573956.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-nvidia-grace-1744368529.tar.gzsize: 301 MiB (316399772 bytes)
entries: 114
modules under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/modules/all
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software
magma/2.7.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90
2023.06/init/easybuild/eb_hooks.py
Apr 11 11:00:41 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-13573956.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

…-layer into 2023.06-PyTorch-2.1.2-with-CUDA-foss-2023a-final
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants