Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[system,grace] EasyBuild 4.8.2, 4.9.0, 4.9.1, 4.9.2, 4.9.3, 4.9.4 #968

Merged
merged 1 commit into from
Mar 21, 2025

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented Mar 13, 2025

First PR to start stack for NVIDIA Grace. See #967 for notes & coordination.

@trz42 trz42 added 2023.06-software.eessi.io 2023.06 version of software.eessi.io grace NVIDIA Grace CPU labels Mar 13, 2025
Copy link

eessi-bot bot commented Mar 13, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphirerapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Mar 13, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

@eessi-bot-trz42
Copy link

Instance trz42-GH200-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@trz42
Copy link
Collaborator Author

trz42 commented Mar 13, 2025

First attempt to verify if it actually builds and then also if the upload with signing works...
bot: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace

Copy link

eessi-bot bot commented Mar 13, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Mar 13, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace resulted in:

    • no jobs were submitted

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Mar 13, 2025

Updates by the bot instance trz42-GH200-jr (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace resulted in:

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Mar 13, 2025

New job on instance trz42-GH200-jr for CPU micro-architecture aarch64-nvidia-grace for repository eessi.io-2023.06-software in job dir /p/project1/ceasybuilders/bot-trz42/jobs/2025.03/pr_968/13510000

  • test step below failed because ReFrame is not available in the stack for Grace yet
date job status comment
Mar 13 19:08:08 UTC 2025 submitted job id 13510000 awaits release by job manager
Mar 13 19:08:38 UTC 2025 released job awaits launch by Slurm scheduler
Mar 14 12:00:44 UTC 2025 running job 13510000 is running
Mar 14 12:14:14 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-13510000.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gzsize: 118 MiB (124619768 bytes)
entries: 196847
modules under 2023.06/software/linux/aarch64/nvidia/grace/modules/all
EasyBuild/4.8.2.lua
EasyBuild/4.9.0.lua
EasyBuild/4.9.1.lua
EasyBuild/4.9.2.lua
EasyBuild/4.9.3.lua
EasyBuild/4.9.4.lua
EESSI-extend/2023.06-easybuild.lua
software under 2023.06/software/linux/aarch64/nvidia/grace/software
EasyBuild/4.8.2
EasyBuild/4.9.0
EasyBuild/4.9.1
EasyBuild/4.9.2
EasyBuild/4.9.3
EasyBuild/4.9.4
EESSI-extend/2023.06-easybuild
other under 2023.06/software/linux/aarch64/nvidia/grace
.lmod/lmodrc.lua
.lmod/SitePackage.lua
Mar 14 12:14:14 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-13510000.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Mar 14 13:37:38 UTC 2025 not uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket failed
Mar 14 14:18:06 UTC 2025 not uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket failed
Mar 14 14:28:39 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 14 21:46:51 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 14 21:50:58 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 18 05:25:43 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 18 05:48:19 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 18 09:43:24 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 19 11:43:16 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 21 22:03:12 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded

@trz42
Copy link
Collaborator Author

trz42 commented Mar 14, 2025

Build looks ok. Testing signing & upload to different S3 bucket used for development.

@trz42 trz42 added the bot:deploy Ask bot to deploy missing software installations to EESSI label Mar 14, 2025
@eessi-bot-trz42
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

1 similar comment
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 14, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 14, 2025

Uploading failed because some Lua initialisation scripts weren't available inside the container. Trying if argument --contain helps in not running these initialisation scripts.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 14, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 14, 2025

--contain seems to help. Trying to fix some locale issue and providing missing S3 access credentials to perform the actual uploads.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 14, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 14, 2025

Upload again with updated sign script (using namespace).

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 14, 2025
@trz42
Copy link
Collaborator Author

trz42 commented Mar 18, 2025

Updated bot instance with code from EESSI/eessi-bot-software-layer#308 and reconfigured it to upload to S3 bucket on minio server (used for testing). Resetting deploy label to verify if updated bot code still works.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 18, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 18, 2025

Signature already existed. Recreating it.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 18, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@boegel
Copy link
Contributor

boegel commented Mar 18, 2025

Signature already existed. Recreating it.

How?

@trz42
Copy link
Collaborator Author

trz42 commented Mar 18, 2025

Redeploying to S3 test bucket.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 18, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 19, 2025

Verifying if the updated upload script takes care of pre-existing signature files (by deleting them before running the sign script). For ref see EESSI/eessi-bot-software-layer#309

Re-setting deploy label

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 19, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 19, 2025

Seems it works...

  • bot log (pyghee.log) contains a line with INFO: removed existing signature file (/.../2025.03/pr_968/13510000/eessi-2023.06-software-linux-aarch64-nvidia-grace-1 741954310.tar.gz.sig)
  • The above signature was manually changed before the test to only include the string foo.
  • The uploaded files (including the signature for the tarball) have the following metadata (note time and size)
    2025-03-19 12:43:11  124619768 eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz
    2025-03-19 12:43:16        663 eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz.meta.txt
    2025-03-19 12:43:15        878 eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz.meta.txt.sig
    2025-03-19 12:43:11        878 eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz.sig
    

@trz42
Copy link
Collaborator Author

trz42 commented Mar 21, 2025

Ok, let's deploy this to the default S3 to get it ingested into EESSI...

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 21, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@bedroge
Copy link
Collaborator

bedroge commented Mar 21, 2025

Staging PR merged.

@bedroge bedroge merged commit 7250444 into EESSI:2023.06-software.eessi.io Mar 21, 2025
59 checks passed
Copy link

eessi-bot bot commented Mar 21, 2025

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.03.21

1 similar comment
Copy link

eessi-bot bot commented Mar 21, 2025

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.03.21

@eessi-bot-trz42
Copy link

PR merged! Moved ['/p/project1/ceasybuilders/bot-trz42/jobs/2025.03/pr_968/13510000'] to /p/project1/ceasybuilders/bot-trz42/trash_bin/EESSI/software-layer/2025.03.21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io bot:deploy Ask bot to deploy missing software installations to EESSI grace NVIDIA Grace CPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants