Update modules on alcf polaris #6985

amametjanov · 2025-02-07T21:33:11Z

Update modules on alcf polaris. Also,

update queues
use cray wrappers for serial-gnu
update cmake for gnugpu builds
add eamxx cmake machine file
run small eam and mpas-o cases on 1 polaris node
add MOAB_ROOT env-var

Fixes #6422

[BFB]

Also, fix build issues: - NVCC_WRAPPER_DEFAULT_COMPILER=CC: g++ v7.5.0 is too old - CRAYPE_LINK_TYPE=dynamic

amametjanov · 2025-02-07T21:34:07Z

Testing:

e3sm_gpucxx + gnugpu: 2 of 2 PASS -- https://my.cdash.org/viewTest.php?buildid=2818477
e3sm_gpuacc + nvidiagpu: 2 of 2 PASS -- https://my.cdash.org/viewTest.php?buildid=2828733
e3sm_integration + gnu: 143 of 146 PASS -- https://my.cdash.org/viewTest.php?buildid=2843940

Kokkos has issues with v3.28.4 and later. Also, reset modules prior to loading a new env.

Update modules on alcf polaris. Also, - update queues - use cray wrappers for serial-gnu - update cmake for gnugpu builds - add eamxx cmake machine file - run small eam and mpas-o cases on 1 polaris node - add MOAB_ROOT env-var Fixes #6422 [BFB]

Also do not archive old test data

Re-merge to next to bring a new commit

To avoid OOM errors. Also add path to Switch.pm perl5 lib.

Re-merge to next to bring another commit

rljacob · 2025-02-26T20:51:03Z

Adding @bartgol since this touches some code in eamxx and so testing needs to run on SNL.

bartgol

The code in eamxx is a mach file, so testing on SNL machine, albeit triggered, is irrelevant. Still, I'll approve and retrigger, so we get more shiny green check marks...

gsever · 2025-02-28T13:20:46Z

I would like to note a few points regarding this Polaris update.

Is the default build setting “-O2” for CUDA instead of “-O3”? “F2010-SCREAMv1” compset used to build in ~400s vs. now ~600s with the current settings on a compute node with 32-cores.
Despite the working status of “gnugpu” compiler option, shouldn’t it be specified more explicitly in config_machines.xml, instead of

<modules compiler="gnugpu">
  <command name="load">nvhpc-mixed</command>

      <modules compiler="gnugpu">
	<command name="load">PrgEnv-gnu/8.5.0</command>
	<command name="load">cudatoolkit-standalone/12.2.2</command>
      </modules>

Likewise, listing other modules in the file with specific versions to ease testing and reproduction.

Queue specs in config_batch.xml doesn’t seem to target longer production runs with max allowed runtime of 3h - https://docs.alcf.anl.gov/polaris/running-jobs/
While for testing purposes there is merit to executing CPU tests on a GPU-specific machine, it would be much more useful if a configuration is deployed on ALCF’s CRUX for practical CPU-based production runs. It may be more helpful to see other GPU tests deployed on Polaris instead (eg, E3SM_EAMXX builds).

gsever · 2025-03-04T17:35:06Z

Related to Polaris setup, I am getting a runtime error with high-node setups (eg, 32 - 128 nodes). The problem usually resolves if I re-submit the case, but I have successful runs that worked at the first attempt.

Please see below the error message from a recent try:

cat e3sm.log.3623155.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov.250304-140655

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #0  0x1538a5e0c2e2 in ???
#1  0x1538a5e0b475 in ???
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #2  0x1538a4a53dbf in ???
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #3  0x1538a4b1aa77 in munmap
    at ../sysdeps/unix/syscall-template.S:78
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #4  0x1538a4aa1eaa in new_heap
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/arena.c:495
#5  0x1538a4aa2a6a in _int_new_arena
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/arena.c:693
#6  0x1538a4aa2a6a in arena_get2
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/arena.c:912
#7  0x1538a4aa5828 in arena_get2
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/arena.c:880
#8  0x1538a4aa5828 in tcache_init
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/malloc.c:2981
#9  0x1538a4aa669d in tcache_init
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/malloc.c:2978
#10  0x1538a4aa669d in __GI___libc_malloc
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/malloc.c:3044
#11  0x1538abf9f45d in ???
#12  0x1538ac06208e in ???
#13  0x1538abf9a44e in ???
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #14  0x1538a5d896e9 in start_thread
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/nptl/pthread_create.c:477
#15  0x1538a4b2150e in clone
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
#16  0xffffffffffffffff in ???
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov: rank 502 died from signal 11 and dumped core
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov: rank 501 died from signal 15

Do you have any suggestions for what may be the root cause of this issue? Different from Perlmutter, Polaris uses a helper script to assign GPUs to MPI ranks, ie https://docs.alcf.anl.gov/running-jobs/example-job-scripts/?h=affinity#setting-gpu-affinity-for-each-mpi-rank which the error is asserted right after the rank assignment.

Thanks,

bartgol · 2025-03-05T00:05:14Z

I do not have any idea about what could cause that error, sorry.

gsever · 2025-03-05T07:45:56Z

@bartgol do you have any recommendation to further debug that failure? Maybe there is a better way to capture the state of system/executable to compare between successful/segfault runs. There is a core file output, but not sure if it is useful without the default env/compilation settings.

bartgol · 2025-03-05T15:52:21Z

I don't. But maybe @jgfouca has some CIME trick up his sleeve?

jgfouca · 2025-03-05T17:04:33Z

I would at least try a DEBUG build so you can get a readable stacktrace.

amametjanov and others added 5 commits February 6, 2025 21:36

Update modules for ALCF Polaris

393f318

Build with Cray wrappers cc/CC/ftn in gnu builds

69aec57

moab root for gnu compiler on polaris

f8557df

Update queues on Polaris

8a008f8

Add local cmake module

8af0cf9

Also, fix build issues: - NVCC_WRAPPER_DEFAULT_COMPILER=CC: g++ v7.5.0 is too old - CRAYPE_LINK_TYPE=dynamic

amametjanov added Machine Files BFB PR leaves answers BFB labels Feb 7, 2025

amametjanov self-assigned this Feb 7, 2025

amametjanov added 5 commits February 11, 2025 21:17

Load cmake v3.27.9

f2800ff

Kokkos has issues with v3.28.4 and later. Also, reset modules prior to loading a new env.

Update cuda path and options

3b27dab

Add eamxx cmake machine file for polaris

a597877

Run eam ne4 cases on 1 polaris node

efa5051

Run mpaso a%T62 cases on 1 polaris node

d5cec0e

amametjanov marked this pull request as ready for review February 11, 2025 21:50

Load newer python

18f2a39

Also do not archive old test data

amametjanov added a commit that referenced this pull request Feb 13, 2025

Merge branch 'azamat/alcf-polaris/update-modules' into next (PR #6985)

8f7fc28

Re-merge to next to bring a new commit

rljacob approved these changes Feb 13, 2025

View reviewed changes

Run coupled tests at 2+ nodes

df7a038

To avoid OOM errors. Also add path to Switch.pm perl5 lib.

amametjanov added a commit that referenced this pull request Feb 14, 2025

Merge branch 'azamat/alcf-polaris/update-modules' into next (PR #6985)

cd71290

Re-merge to next to bring another commit

Update perl path

c609841

mahf708 mentioned this pull request Feb 26, 2025

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on polaris #6422

Open

rljacob requested a review from bartgol February 26, 2025 20:50

bartgol approved these changes Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update modules on alcf polaris #6985

Update modules on alcf polaris #6985

amametjanov commented Feb 7, 2025 •

edited

Loading

amametjanov commented Feb 7, 2025 •

edited

Loading

rljacob commented Feb 26, 2025

bartgol left a comment

gsever commented Feb 28, 2025

gsever commented Mar 4, 2025

bartgol commented Mar 5, 2025

gsever commented Mar 5, 2025

bartgol commented Mar 5, 2025

jgfouca commented Mar 5, 2025

Update modules on alcf polaris #6985

Are you sure you want to change the base?

Update modules on alcf polaris #6985

Conversation

amametjanov commented Feb 7, 2025 • edited Loading

amametjanov commented Feb 7, 2025 • edited Loading

rljacob commented Feb 26, 2025

bartgol left a comment

Choose a reason for hiding this comment

gsever commented Feb 28, 2025

gsever commented Mar 4, 2025

bartgol commented Mar 5, 2025

gsever commented Mar 5, 2025

bartgol commented Mar 5, 2025

jgfouca commented Mar 5, 2025

amametjanov commented Feb 7, 2025 •

edited

Loading

amametjanov commented Feb 7, 2025 •

edited

Loading