-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm/6.2.4 causes occasional segmentation faults on Frontier during MPI_Init (see OLCFDEV-1655) #7075
Comments
@dqwu , @rljacob , I have noticed that two SCREAM-specific arguments are being used in the Frontier machine configuration (SCREAM_SYSTEM_WORKAROUND, SCREAM_SYSTEM_WORKAROUND_P3_PART2). I am a little concerned that these SCREAM-specific arguments are being used in the machine file, which should be shared by all other models. Is there a way to specify the SCREAM-specific arguments within the EAMxx settings? |
SCREAM_SYSTEM_WORKAROUND_P3_PART2 was added by @trey-ornl |
@grnydawn They are temporary workarounds which might be removed/disabled later (e.g. with newer versions of Rocm). |
Those changes were recommended to me by @ambrad. Instead of keying off of macro names, like these, we could key off of the build environment for Kokkos.
|
What exactly is the workaround doing? Use that for the root of the name instead of "SCREAM". |
It is suggested by OLCFDEV-1655
In driver-mct/main/cime_comp_mod.F90, this macro is required to enable this workaround:
|
@rljacob FYI, this macro name was suggested by @ambrad in PR E3SM-Project/scream#2918
|
We are trying to get rid of indiscriminate use of the word "SCREAM" so this would be better named "MPINIT_WORKAROUND" |
What should be the new name for "SCREAM_SYSTEM_WORKAROUND_P3_PART2"? |
I'd call that CLANGOPT_WORKAROUND. |
Agreed. Don't worry about the old naming scheme. In retrospect, it wasn't a good system. |
If there's a gh issue, I would recommend adding _ISSUE_XYZ in the macro name. It would greatly help users to quickly track its origin. If no gh issue exist, I would create one, document as much as possible, then go back to my suggestion above. |
* set MPINIT_WORKAROUND to "craygnu-hipcc.cmake" and "craygnu-mphipcc.cmake" * set CLANGOPT_WORKAROUND to "craygnu-hipcc.cmake" * discard "SCREAM_SYSTEM_WORKAROUND from "craycray-mphipcc" [BFB] Fixes #7075
Please don't put issue number in cpp names. In general, we don't want github things in the source code. Source code is forever while github could go away tomorrow if Microsoft wanted to. |
* set MPINIT_WORKAROUND to "craygnu-hipcc.cmake" and "craygnu-mphipcc.cmake" * set CLANGOPT_WORKAROUND to "craygnu-hipcc.cmake" * discard "SCREAM_SYSTEM_WORKAROUND from "craycray-mphipcc" [BFB] Fixes #7075
On Frontier, the craygnu-hipcc and craygnu-mphipcc compilers currently use ROCm/6.2.4. However, according to OLCFDEV-1655, this version may cause occasional segmentation faults during MPI_Init:
This issue has been confirmed by some latest ne1024 SCREAM decadal runs using craygnu-hipcc and craygnu-mphipcc.
Possible Workarounds
Use an older ROCm version
ROCm/5.5.1, 5.6.0, and 5.7.1 are confirmed to avoid this issue according to OLCFDEV-1655.
However, the newest ROCm versions are preferred, and these older versions are known to cause build errors.
Restore "-DSCREAM_SYSTEM_WORKAROUND=1" for craygnu-hipcc and craygnu-mphipcc
This workaround (suggested by OLCFDEV-1655) was previously applied in Frontier: Additional post-maintenance updates scream#2923 for crayclang-scream_frontier-scream-gpu.cmake and later disabled in Frontier: disable hipInit before MPI_Init scream#2943.
Proposed Fix
Before this issue is fixed in a future ROCm version, reintroduce -DSCREAM_SYSTEM_WORKAROUND=1 for craygnu-hipcc and craygnu-mphipcc in the following files to avoid the segmentation fault issue:
craygnu-hipcc.cmake
craygnu-mphipcc.cmake
The text was updated successfully, but these errors were encountered: