Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test building on Snellius: Zen4/H100 #903

Open
wants to merge 32 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
6674bc3
Add pmt for H100 to test eessi bot on Snellius
Jan 31, 2025
cfcadfd
Add CUDA explicitely, since we also need the runtime part to be insta…
Jan 31, 2025
705db14
Need to strip single quotes
Feb 4, 2025
d039321
Need to strip single quotes
Feb 4, 2025
afddc24
It still didn't get expanded over multiple entries. Better make an ex…
Feb 4, 2025
63efcdc
Use bash array when looping over loadable modules in test.sh
Feb 5, 2025
0e17f32
Extra echo's for debugging
Feb 5, 2025
30c305b
print TMPDIR as word
Feb 5, 2025
190406d
RFemove debugging echo's, bind-mount the TMPDIR if it is set
Feb 5, 2025
79bdc9b
Add debugging output
Feb 5, 2025
06edeb3
Fix typo
Feb 5, 2025
16c748f
Check tmpdir early on
Feb 5, 2025
6960cfd
Add more debugging output
Feb 5, 2025
8344047
More dbugging output
Feb 5, 2025
71c2dc7
Make sure we unconditionally set TMPDIR, and make sure we also set --…
Feb 5, 2025
ed650a0
Debugging cleanup
Feb 5, 2025
be2fd57
Add something fast to build..
Feb 5, 2025
86b5608
Make sure STORAGE gets bind-mounted
Feb 5, 2025
a6d963e
Add BCFtools to the easyconfig with latest eb version
Feb 5, 2025
e444af8
Fix issue that pops up if the nvidia-smi command is present on non-GP…
Feb 6, 2025
8c3ad1d
Temporarily disable set -e, because we know and accept that nvidia-sm…
Feb 6, 2025
aaf01de
Fix the check in the if-statement
Feb 6, 2025
7571be6
Make warning more clear
Feb 6, 2025
6d164b9
Debugging output
Feb 6, 2025
682fed1
Remove BCFtool as that was only a test build on CPU with short build …
Feb 6, 2025
4719468
Remove debugging prints
Feb 10, 2025
6fcaf89
Build CUDA 12.1.1 and CUDA 12.4.0
Feb 12, 2025
4b36207
Fix mistake
Feb 13, 2025
f241881
Add pmt
Feb 13, 2025
7523f94
Merge branch '2023.06-software.eessi.io' into test_H100_build
Feb 13, 2025
383ed27
Add support for compression with zstd, which is typically much faster…
Apr 2, 2025
95d7d56
Add -T0 and correct if-elif-then statements
Apr 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions bot/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,10 @@ echo "bot/build.sh: STORAGE='${STORAGE}'"
# make sure ${STORAGE} exists
mkdir -p ${STORAGE}

# Make sure ${STORAGE} gets bind-mounted
# This will make sure that any subsequent jobs that create dirs or files under STORAGE have access to it in the container
export SINGULARITY_BIND="${SINGULARITY_BIND},${STORAGE}"

# make sure the base tmp storage is unique
JOB_STORAGE=$(mktemp --directory --tmpdir=${STORAGE} bot_job_tmp_XXX)
echo "bot/build.sh: created unique base tmp storage directory at ${JOB_STORAGE}"
Expand Down Expand Up @@ -306,6 +310,9 @@ else
TARBALL_STEP_ARGS+=("--resume" "${REMOVAL_TMPDIR}")
fi

# Make sure we define storage, so that the TMPDIR is set to this in eessi_container.sh
TARBALL_STEP_ARGS+=("--storage" "${STORAGE}")

timestamp=$(date +%s)
# to set EESSI_VERSION we need to source init/eessi_defaults now
source $software_layer_dir/init/eessi_defaults
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ easyconfigs:
- CUDA-12.1.1.eb:
options:
accept-eula-for: CUDA
- pmt-1.2.0-GCCcore-12.3.0-CUDA-12.1.1.eb
90 changes: 56 additions & 34 deletions eessi_container.sh
Original file line number Diff line number Diff line change
Expand Up @@ -363,47 +363,55 @@ fi
# 2. set up host storage/tmp if necessary
# if session to be resumed from a previous one (--resume ARG) and ARG is a directory
# just reuse ARG, define environment variables accordingly and skip creating a new
# tmp storage
# eessi.XXXXXXXXXXX tempdir within TMPDIR

# But before we call mktemp, we need to potentially set or create TMPDIR
# as location for temporary data use in the following order
# a. command line argument -l|--host-storage
# b. env var TMPDIR
# c. /tmp
# note, we ensure that (a) takes precedence by setting TMPDIR to STORAGE
# if STORAGE is not empty
# note, (b) & (c) are automatically ensured by using 'mktemp -d --tmpdir' to
# create a temporary directory
if [[ ! -z ${STORAGE} ]]; then
export TMPDIR=${STORAGE}
# mktemp fails if TMPDIR does not exist, so let's create it
mkdir -p ${TMPDIR}
fi
if [[ ! -z ${TMPDIR} ]]; then
# TODO check if TMPDIR already exists
# mktemp fails if TMPDIR does not exist, so let's create it
mkdir -p ${TMPDIR}
fi
if [[ -z ${TMPDIR} ]]; then
# mktemp falls back to using /tmp if TMPDIR is empty
# TODO check if /tmp is writable, large enough and usable (different
# features for ro-access and rw-access)
[[ ${VERBOSE} -eq 1 ]] && echo "skipping sanity checks for /tmp"
fi

# Now, set the EESSI_HOST_STORAGE either baed on the resumed directory, or create a new one with mktemp
if [[ ! -z ${RESUME} && -d ${RESUME} ]]; then
# resume from directory ${RESUME}
# skip creating a new tmp directory, just set environment variables
echo "Resuming from previous run using temporary storage at ${RESUME}"
EESSI_HOST_STORAGE=${RESUME}
else
# we need a tmp location (and possibly init it with ${RESUME} if it was not
# a directory

# as location for temporary data use in the following order
# a. command line argument -l|--host-storage
# b. env var TMPDIR
# c. /tmp
# note, we ensure that (a) takes precedence by setting TMPDIR to STORAGE
# if STORAGE is not empty
# note, (b) & (c) are automatically ensured by using 'mktemp -d --tmpdir' to
# create a temporary directory
if [[ ! -z ${STORAGE} ]]; then
export TMPDIR=${STORAGE}
# mktemp fails if TMPDIR does not exist, so let's create it
mkdir -p ${TMPDIR}
fi
if [[ ! -z ${TMPDIR} ]]; then
# TODO check if TMPDIR already exists
# mktemp fails if TMPDIR does not exist, so let's create it
mkdir -p ${TMPDIR}
fi
if [[ -z ${TMPDIR} ]]; then
# mktemp falls back to using /tmp if TMPDIR is empty
# TODO check if /tmp is writable, large enough and usable (different
# features for ro-access and rw-access)
[[ ${VERBOSE} -eq 1 ]] && echo "skipping sanity checks for /tmp"
fi
EESSI_HOST_STORAGE=$(mktemp -d --tmpdir eessi.XXXXXXXXXX)
echo "Using ${EESSI_HOST_STORAGE} as tmp directory (to resume session add '--resume ${EESSI_HOST_STORAGE}')."
fi

# if ${RESUME} is a file (assume a tgz), unpack it into ${EESSI_HOST_STORAGE}
# if ${RESUME} is a file, unpack it into ${EESSI_HOST_STORAGE}
if [[ ! -z ${RESUME} && -f ${RESUME} ]]; then
tar xf ${RESUME} -C ${EESSI_HOST_STORAGE}
if [[ "${RESUME}" == *.tgz ]]; then
tar xf ${RESUME} -C ${EESSI_HOST_STORAGE}
# Add support for resuming from zstd-compressed tarballs
elif [[ "${RESUME}" == *.zst && -x "$(command -v zstd)" ]]; then
zstd -dc ${RESUME} | tar -xf - -C ${EESSI_HOST_STORAGE}
elif [[ "${RESUME}" == *.zst && ! -x "$(command -v zstd)" ]]; then
fatal_error "Trying to resume from tarball ${RESUME} which was compressed using zstd, but zstd command not found"
fi
echo "Resuming from previous run using temporary storage ${RESUME} unpacked into ${EESSI_HOST_STORAGE}"
fi

Expand Down Expand Up @@ -853,17 +861,31 @@ if [[ ! -z ${SAVE} ]]; then
# ARCH which might have been used internally, eg, when software packages
# were built ... we rather keep the script here "stupid" and leave the handling
# of these aspects to where the script is used

# Compression with zlib may be quite slow. On some systems, the pipeline takes ~20 mins for a 2 min build because of this.
# Check if zstd is present for faster compression and decompression
if [[ -d ${SAVE} ]]; then
# assume SAVE is name of a directory to which tarball shall be written to
# name format: tmp_storage-{TIMESTAMP}.tgz
ts=$(date +%s)
TGZ=${SAVE}/tmp_storage-${ts}.tgz
if [[ -x "$(command -v zstd)" ]]; then
TARBALL=${SAVE}/tmp_storage-${ts}.zst
tar -cf - -C ${EESSI_TMPDIR} . | zstd -T0 > ${TARBALL}
else
TARBALL=${SAVE}/tmp_storage-${ts}.tgz
tar czf ${TARBALL} -C ${EESSI_TMPDIR} .
fi
else
# assume SAVE is the full path to a tarball's name
TGZ=${SAVE}
TARBALL=${SAVE}
# if zstd is present and a .zst extension is asked for, use it
if [[ "${SAVE}" == *.zst && -x "$(command -v zstd)" ]]; then
tar -cf - -C ${EESSI_TMPDIR} . | zstd -T0 > ${TARBALL}
else
tar czf ${TARBALL} -C ${EESSI_TMPDIR}
fi
fi
tar czf ${TGZ} -C ${EESSI_TMPDIR} .
echo "Saved contents of tmp directory '${EESSI_TMPDIR}' to tarball '${TGZ}' (to resume session add '--resume ${TGZ}')"
echo "Saved contents of tmp directory '${EESSI_TMPDIR}' to tarball '${TARBALL}' (to resume session add '--resume ${TARBALL}')"
fi

# TODO clean up tmp by default? only retain if another option provided (--retain-tmp)
Expand Down
2 changes: 1 addition & 1 deletion test_suite.sh
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ else
fatal_error "Failed to extract names of tests to run: ${REFRAME_NAME_ARGS}"
exit ${test_selection_exit_code}
fi
# Allow people deploying the bot to overrwide this
# Allow people deploying the bot to override this
if [ -z "$REFRAME_SCALE_TAG" ]; then
REFRAME_SCALE_TAG="--tag 1_node"
fi
Expand Down
Loading