Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update modules on alcf polaris #6985

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

amametjanov
Copy link
Member

@amametjanov amametjanov commented Feb 7, 2025

Update modules on alcf polaris. Also,

  • update queues
  • use cray wrappers for serial-gnu
  • update cmake for gnugpu builds
  • add eamxx cmake machine file
  • run small eam and mpas-o cases on 1 polaris node
  • add MOAB_ROOT env-var

Fixes #6422

[BFB]

@amametjanov amametjanov added Machine Files BFB PR leaves answers BFB labels Feb 7, 2025
@amametjanov amametjanov self-assigned this Feb 7, 2025
@amametjanov
Copy link
Member Author

amametjanov commented Feb 7, 2025

Testing:

@amametjanov amametjanov marked this pull request as ready for review February 11, 2025 21:50
amametjanov added a commit that referenced this pull request Feb 12, 2025
Update modules on alcf polaris. Also,
- update queues
- use cray wrappers for serial-gnu
- update cmake for gnugpu builds
- add eamxx cmake machine file
- run small eam and mpas-o cases on 1 polaris node
- add MOAB_ROOT env-var

Fixes #6422

[BFB]
Also do not archive old test data
amametjanov added a commit that referenced this pull request Feb 13, 2025
To avoid OOM errors. Also add path to Switch.pm perl5 lib.
amametjanov added a commit that referenced this pull request Feb 14, 2025
@rljacob
Copy link
Member

rljacob commented Feb 26, 2025

Adding @bartgol since this touches some code in eamxx and so testing needs to run on SNL.

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in eamxx is a mach file, so testing on SNL machine, albeit triggered, is irrelevant. Still, I'll approve and retrigger, so we get more shiny green check marks...

@gsever
Copy link

gsever commented Feb 28, 2025

I would like to note a few points regarding this Polaris update.

  1. Is the default build setting “-O2” for CUDA instead of “-O3”? “F2010-SCREAMv1” compset used to build in ~400s vs. now ~600s with the current settings on a compute node with 32-cores.

  2. Despite the working status of “gnugpu” compiler option, shouldn’t it be specified more explicitly in config_machines.xml, instead of

<modules compiler="gnugpu">
  <command name="load">nvhpc-mixed</command>
      <modules compiler="gnugpu">
	<command name="load">PrgEnv-gnu/8.5.0</command>
	<command name="load">cudatoolkit-standalone/12.2.2</command>
      </modules>

Likewise, listing other modules in the file with specific versions to ease testing and reproduction.

  1. Queue specs in config_batch.xml doesn’t seem to target longer production runs with max allowed runtime of 3h - https://docs.alcf.anl.gov/polaris/running-jobs/

  2. While for testing purposes there is merit to executing CPU tests on a GPU-specific machine, it would be much more useful if a configuration is deployed on ALCF’s CRUX for practical CPU-based production runs. It may be more helpful to see other GPU tests deployed on Polaris instead (eg, E3SM_EAMXX builds).

@gsever
Copy link

gsever commented Mar 4, 2025

Related to Polaris setup, I am getting a runtime error with high-node setups (eg, 32 - 128 nodes). The problem usually resolves if I re-submit the case, but I have successful runs that worked at the first attempt.

Please see below the error message from a recent try:

cat e3sm.log.3623155.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov.250304-140655

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #0  0x1538a5e0c2e2 in ???
#1  0x1538a5e0b475 in ???
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #2  0x1538a4a53dbf in ???
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #3  0x1538a4b1aa77 in munmap
    at ../sysdeps/unix/syscall-template.S:78
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #4  0x1538a4aa1eaa in new_heap
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/arena.c:495
#5  0x1538a4aa2a6a in _int_new_arena
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/arena.c:693
#6  0x1538a4aa2a6a in arena_get2
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/arena.c:912
#7  0x1538a4aa5828 in arena_get2
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/arena.c:880
#8  0x1538a4aa5828 in tcache_init
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/malloc.c:2981
#9  0x1538a4aa669d in tcache_init
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/malloc.c:2978
#10  0x1538a4aa669d in __GI___libc_malloc
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/malloc/malloc.c:3044
#11  0x1538abf9f45d in ???
#12  0x1538ac06208e in ???
#13  0x1538abf9a44e in ???
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov 502: #14  0x1538a5d896e9 in start_thread
    at /usr/src/debug/glibc-2.31-150300.63.1.x86_64/nptl/pthread_create.c:477
#15  0x1538a4b2150e in clone
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
#16  0xffffffffffffffff in ???
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov: rank 502 died from signal 11 and dumped core
x3015c0s31b0n0.hsn.cm.polaris.alcf.anl.gov: rank 501 died from signal 15

Do you have any suggestions for what may be the root cause of this issue? Different from Perlmutter, Polaris uses a helper script to assign GPUs to MPI ranks, ie https://docs.alcf.anl.gov/running-jobs/example-job-scripts/?h=affinity#setting-gpu-affinity-for-each-mpi-rank which the error is asserted right after the rank assignment.

Thanks,

@bartgol
Copy link
Contributor

bartgol commented Mar 5, 2025

I do not have any idea about what could cause that error, sorry.

@gsever
Copy link

gsever commented Mar 5, 2025

@bartgol do you have any recommendation to further debug that failure? Maybe there is a better way to capture the state of system/executable to compare between successful/segfault runs. There is a core file output, but not sure if it is useful without the default env/compilation settings.

@bartgol
Copy link
Contributor

bartgol commented Mar 5, 2025

I don't. But maybe @jgfouca has some CIME trick up his sleeve?

@jgfouca
Copy link
Member

jgfouca commented Mar 5, 2025

I would at least try a DEBUG build so you can get a readable stacktrace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB Machine Files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on polaris
6 participants