-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update modules on alcf polaris #6985
base: master
Are you sure you want to change the base?
Conversation
Also, fix build issues: - NVCC_WRAPPER_DEFAULT_COMPILER=CC: g++ v7.5.0 is too old - CRAYPE_LINK_TYPE=dynamic
Testing:
|
Kokkos has issues with v3.28.4 and later. Also, reset modules prior to loading a new env.
Update modules on alcf polaris. Also, - update queues - use cray wrappers for serial-gnu - update cmake for gnugpu builds - add eamxx cmake machine file - run small eam and mpas-o cases on 1 polaris node - add MOAB_ROOT env-var Fixes #6422 [BFB]
Also do not archive old test data
Re-merge to next to bring a new commit
To avoid OOM errors. Also add path to Switch.pm perl5 lib.
Re-merge to next to bring another commit
Adding @bartgol since this touches some code in eamxx and so testing needs to run on SNL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in eamxx is a mach file, so testing on SNL machine, albeit triggered, is irrelevant. Still, I'll approve and retrigger, so we get more shiny green check marks...
I would like to note a few points regarding this Polaris update.
Likewise, listing other modules in the file with specific versions to ease testing and reproduction.
|
Related to Polaris setup, I am getting a runtime error with high-node setups (eg, 32 - 128 nodes). The problem usually resolves if I re-submit the case, but I have successful runs that worked at the first attempt. Please see below the error message from a recent try:
Do you have any suggestions for what may be the root cause of this issue? Different from Perlmutter, Polaris uses a helper script to assign GPUs to MPI ranks, ie https://docs.alcf.anl.gov/running-jobs/example-job-scripts/?h=affinity#setting-gpu-affinity-for-each-mpi-rank which the error is asserted right after the rank assignment. Thanks, |
I do not have any idea about what could cause that error, sorry. |
@bartgol do you have any recommendation to further debug that failure? Maybe there is a better way to capture the state of system/executable to compare between successful/segfault runs. There is a core file output, but not sure if it is useful without the default env/compilation settings. |
I don't. But maybe @jgfouca has some CIME trick up his sleeve? |
I would at least try a DEBUG build so you can get a readable stacktrace. |
Update modules on alcf polaris. Also,
Fixes #6422
[BFB]