Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm/6.2.4 causes occasional segmentation faults on Frontier during MPI_Init (see OLCFDEV-1655) #7075

Open
dqwu opened this issue Mar 3, 2025 · 14 comments · May be fixed by #7082
Open

ROCm/6.2.4 causes occasional segmentation faults on Frontier during MPI_Init (see OLCFDEV-1655) #7075

dqwu opened this issue Mar 3, 2025 · 14 comments · May be fixed by #7082

Comments

@dqwu
Copy link
Contributor

dqwu commented Mar 3, 2025

On Frontier, the craygnu-hipcc and craygnu-mphipcc compilers currently use ROCm/6.2.4. However, according to OLCFDEV-1655, this version may cause occasional segmentation faults during MPI_Init:

Resolution: This issue is resolved in ROCm/5.5.1, 5.6.0, and 5.7.1, but has re-appeared in >= ROCm/6.1.x.

This issue has been confirmed by some latest ne1024 SCREAM decadal runs using craygnu-hipcc and craygnu-mphipcc.

Possible Workarounds

  1. Use an older ROCm version
    ROCm/5.5.1, 5.6.0, and 5.7.1 are confirmed to avoid this issue according to OLCFDEV-1655.
    However, the newest ROCm versions are preferred, and these older versions are known to cause build errors.

  2. Restore "-DSCREAM_SYSTEM_WORKAROUND=1" for craygnu-hipcc and craygnu-mphipcc
    This workaround (suggested by OLCFDEV-1655) was previously applied in Frontier: Additional post-maintenance updates scream#2923 for crayclang-scream_frontier-scream-gpu.cmake and later disabled in Frontier: disable hipInit before MPI_Init scream#2943.

Proposed Fix

Before this issue is fixed in a future ROCm version, reintroduce -DSCREAM_SYSTEM_WORKAROUND=1 for craygnu-hipcc and craygnu-mphipcc in the following files to avoid the segmentation fault issue:
craygnu-hipcc.cmake

-string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND_P3_PART2")
+string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND_P3_PART2 -DSCREAM_SYSTEM_WORKAROUND=1")

craygnu-mphipcc.cmake

-string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU")
+string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND=1")
@grnydawn
Copy link
Contributor

grnydawn commented Mar 3, 2025

@dqwu , @rljacob , I have noticed that two SCREAM-specific arguments are being used in the Frontier machine configuration (SCREAM_SYSTEM_WORKAROUND, SCREAM_SYSTEM_WORKAROUND_P3_PART2). I am a little concerned that these SCREAM-specific arguments are being used in the machine file, which should be shared by all other models. Is there a way to specify the SCREAM-specific arguments within the EAMxx settings?

@dqwu
Copy link
Contributor Author

dqwu commented Mar 3, 2025

SCREAM_SYSTEM_WORKAROUND_P3_PART2

SCREAM_SYSTEM_WORKAROUND_P3_PART2 was added by @trey-ornl

@dqwu
Copy link
Contributor Author

dqwu commented Mar 3, 2025

@grnydawn They are temporary workarounds which might be removed/disabled later (e.g. with newer versions of Rocm).

@trey-ornl
Copy link
Contributor

SCREAM_SYSTEM_WORKAROUND_P3_PART2

SCREAM_SYSTEM_WORKAROUND_P3_PART2 was added by @trey-ornl

Those changes were recommended to me by @ambrad.
E3SM-Project/scream#2969 (comment)

Instead of keying off of macro names, like these, we could key off of the build environment for Kokkos.

#ifdef KOKKOS_ENABLE_HIP

@rljacob
Copy link
Member

rljacob commented Mar 3, 2025

What exactly is the workaround doing? Use that for the root of the name instead of "SCREAM".

@dqwu
Copy link
Contributor Author

dqwu commented Mar 3, 2025

What exactly is the workaround doing? Use that for the root of the name instead of "SCREAM".

It is suggested by OLCFDEV-1655

Workarounds:
Run hipInit(0); at any point prior to MPI_Init(&argc, &argv);. You may need to add #include <hip/hip_runtime_api.h>.

In driver-mct/main/cime_comp_mod.F90, this macro is required to enable this workaround:

if defined(SCREAM_SYSTEM_WORKAROUND) && (SCREAM_SYSTEM_WORKAROUND == 1)
    call atm_init_hip_mct()
#endif
    call mpi_init(ierr)

@dqwu
Copy link
Contributor Author

dqwu commented Mar 3, 2025

@rljacob FYI, this macro name was suggested by @ambrad in PR E3SM-Project/scream#2918

What do you think of adding a more generic flag that we can use in all future instances for this same purpose? I'm thinking something like SCREAM_SYSTEM_WORKAROUND. To make this approach maximally flexible (example in a moment), we'd want to permit setting integer values. Then any place in general E3SM code that requires a workaround would have something like

#if !defined SCREAM_SYSTEM_WORKAROUND || SCREAM_SYSTEM_WORKAROUND != 1
... this code should run in general ...
#endif

Once that flag is available, we would be able to use it as needed without messing with the build system again.

The purpose for the integer complication is as follows. In the future one can imagine needing separate workarounds for Aurora and Frontier. We might then say that Frontier is assigned 1 and Aurora is assigned 2.

@rljacob
Copy link
Member

rljacob commented Mar 3, 2025

We are trying to get rid of indiscriminate use of the word "SCREAM" so this would be better named "MPINIT_WORKAROUND"

@dqwu
Copy link
Contributor Author

dqwu commented Mar 3, 2025

We are trying to get rid of indiscriminate use of the word "SCREAM" so this would be better named "MPINIT_WORKAROUND"

What should be the new name for "SCREAM_SYSTEM_WORKAROUND_P3_PART2"?

@rljacob
Copy link
Member

rljacob commented Mar 3, 2025

I'd call that CLANGOPT_WORKAROUND.

@ambrad
Copy link
Member

ambrad commented Mar 4, 2025

Agreed. Don't worry about the old naming scheme. In retrospect, it wasn't a good system.

@bartgol
Copy link
Contributor

bartgol commented Mar 4, 2025

If there's a gh issue, I would recommend adding _ISSUE_XYZ in the macro name. It would greatly help users to quickly track its origin.

If no gh issue exist, I would create one, document as much as possible, then go back to my suggestion above.

grnydawn added a commit that referenced this issue Mar 4, 2025
* set MPINIT_WORKAROUND to "craygnu-hipcc.cmake" and "craygnu-mphipcc.cmake"
* set CLANGOPT_WORKAROUND to "craygnu-hipcc.cmake"
* discard "SCREAM_SYSTEM_WORKAROUND from "craycray-mphipcc"

[BFB]
Fixes #7075
@rljacob
Copy link
Member

rljacob commented Mar 4, 2025

Please don't put issue number in cpp names. In general, we don't want github things in the source code. Source code is forever while github could go away tomorrow if Microsoft wanted to.

grnydawn added a commit that referenced this issue Mar 4, 2025
* set MPINIT_WORKAROUND to "craygnu-hipcc.cmake" and "craygnu-mphipcc.cmake"
* set CLANGOPT_WORKAROUND to "craygnu-hipcc.cmake"
* discard "SCREAM_SYSTEM_WORKAROUND from "craycray-mphipcc"

[BFB]
Fixes #7075
@grnydawn grnydawn linked a pull request Mar 4, 2025 that will close this issue
@grnydawn
Copy link
Contributor

grnydawn commented Mar 4, 2025

I created PR #7082 to resolve this issue by applying the renaming suggestions. @dqwu, please check if this PR fixes the issue you encountered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants