Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

March 31, 2020 - Jules Kouatchou #27

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

JulesKouatchou
Copy link
Contributor

ctm_setup:

  • New setting of environment variables (SETENV.commands) for the use of MPT

ctm_run.j:

  • The @SETENVS tag was placed before the MPI run command is issued.
  • Checked the exist status of the executable using the EGRESS file.

ctm_setup:
  - New setting of environment variables (SETENV.commands) for the use of MPT

ctm_run.j:
  - The @SETENVS tag was placed before the MPI run command is issued.
  - Checked the exist status of the executable using the EGRESS file.
@mmanyin
Copy link
Contributor

mmanyin commented Mar 31, 2020

@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks

@mmanyin
Copy link
Contributor

mmanyin commented Mar 31, 2020

@JulesKouatchou Do I understand correctly, you have changed the run-time environment variables to be appropriate for using MPT? How do we compile w/ MPT? I thought that the default was to compile with Intel-MPI.

@mathomp4
Copy link
Member

@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks

@mmanyin Until #23 is merged in, there is no way for CircleCI to find a configuration since it only exists on a branch, not master. I had set up CircleCI to follow GEOSctm thinking the config file would get it. Since it might be a while, would you like me to turn off CircleCI following GEOSctm?

@mmanyin
Copy link
Contributor

mmanyin commented Apr 1, 2020

@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks

@mmanyin Until #23 is merged in, there is no way for CircleCI to find a configuration since it only exists on a branch, not master. I had set up CircleCI to follow GEOSctm thinking the config file would get it. Since it might be a while, would you like me to turn off CircleCI following GEOSctm?

Actually I will go ahead with #23 . Sorry for the confusion!

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@mathomp4
Copy link
Member

mathomp4 commented Apr 3, 2020

Well, that was unexpected.

@JulesKouatchou When you have a chance can you do a fresh clone of GEOSctm and then a fresh checkout of your branch, and then try running with MPT?

I just did a "resolve conflict" for your branch (so it could merge in) and, weirdly, Git seems to say now that the ctm_setup now isn't "new". I mean, it seems to have all the right bits for MAPL 2 on MPT, but...weird.

On the plus side, @mmanyin, it looks like that "resolve conflict" is letting CircleCI run!

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 I will and let you know.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 When I dod:

  git clone [email protected]:GEOS-ESM/GEOSctm.git
 cd GEOSctm/
 git checkout -b jkGEOSctm_on_SLESS12
 checkout_externals
 source @env/g5_modules

Intel MPI get loaded. I need MPT.

@mathomp4
Copy link
Member

mathomp4 commented Apr 3, 2020

@mathomp4 When I dod:

  git clone [email protected]:GEOS-ESM/GEOSctm.git
 cd GEOSctm/
 git checkout -b jkGEOSctm_on_SLESS12
 checkout_externals
 source @env/g5_modules

Intel MPI get loaded. I need MPT.

Jules,

You'll need to:

cp /gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.0.4/g5_modules.intel1805.mpt217 @env/g5_modules

to get MPT as an MPI stack

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 Here are my steps:

git clone [email protected]:GEOS-ESM/GEOSctm.git

cd GEOSctm
git checkout jkGEOSctm_on_SLESS12
checkout_externals
cp /gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.0.4/g5_modules.intel1805.mpt217
@env/g5_modules
source @env/g5_modules

Things appear to be fine. I am currently doing a long run to make sure that the code does not crash.

Thanks.

@mathomp4
Copy link
Member

mathomp4 commented Apr 3, 2020

Sounds good! If all works, you can set the appropriate "required label". I'm guessing 0-diff is good enough since your changes can't change results, right?

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 This is the first step. I want to code to be able to compile and run. Ideally, I want the same code to compile and run on SLESS11 nodes too (though they will disappear soon). I will then be able to do the comparison.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 My long run did not have any issue.
You asked me to copy the file g5_modules.intel1805.mpt217. Is it possible to make it part of the repository? I want MPT module to be the default for the CTM.

@mathomp4
Copy link
Member

mathomp4 commented Apr 4, 2020

Jules,

We can do that for sure, but then when the hundreds of Skylake nodes go online for the general user they will not be able to use them. Intel MPI allows users to use every node on NCCS.

Before we issue that, ctm_run.j should be altered so that if anyone ever tries to run on the Skylakes at NCCS with MPT, the CTM must immediately error out with a non-zero status code. And maybe a note saying what’s happening so that the user doesn’t try to contact NCCS or the SI Team. I mean, the job will crash anyway, but it will be an obscure looking loader error I think.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules.

@mathomp4
Copy link
Member

mathomp4 commented Apr 8, 2020

@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules.

@JulesKouatchou I don't think so, not as long as GEOS uses g5_modules. The issue is that it is a script that is run and a file that is sourced. This severely limits its flexibility because you can break it very easily (for example, you can not do source g5_modules -option).

If you require MPT, I can create a special branch/tag of ESMA_env for you.

You should also contact NCCS and let them know that Intel MPI does not work for your code. They will be interested in this and would probably want to contact Intel regarding the fault.

@mmanyin
Copy link
Contributor

mmanyin commented Apr 8, 2020

@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules.

@JulesKouatchou I don't think so, not as long as GEOS uses g5_modules. The issue is that it is a script that is run and a file that is sourced. This severely limits its flexibility because you can break it very easily (for example, you can not do source g5_modules -option).

If you require MPT, I can create a special branch/tag of ESMA_env for you.

You should also contact NCCS and let them know that Intel MPI does not work for your code. They will be interested in this and would probably want to contact Intel regarding the fault.

I have seen Intel MPI crash during Finalize, when running the GCM under SLES12. @JulesKouatchou please CC me when you contact NCCS about this problem; I will open a case as well, and CC you.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 @mmanyin I have tried to build the simplest test case possible (using Intel MPI on SLESS12 nodes) where the code does not exist gracefully. So far I have not duplicated the problem with a purely MPI program and a ESMF program. I now want to try a code that uses MAPL.

@mathomp4
Copy link
Member

@JulesKouatchou We might have a workaround for the MPI_Finalize issue. I found an MPI command which essentially "turns off error output" and @bena-nasa seemed to be able to show it helped.

We are looking at adding it into MAPL with some good protections so we don't turn off all MPI errors.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 Great! Let me know when the workaround is ready so that I can test it.

@mathomp4
Copy link
Member

@mathomp4 Great! Let me know when the workaround is ready so that I can test it.

Jules, try out MAPL v2.0.6 (aka git checkout v2.0.6 in MAPL)

Note, you're behind on a lot of things in CTM in it's mepo/externals bits) but v2.0.0 and v2.0.6 are still similar.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 Here is a summary of what happened when I used MAPL v2.0.6.

  • I used the modules comp/intel/18.0.5.274 and mpi/impi/19.1.0.166.
  • GEOS CTM exited gracefully while doing short runs (few days).
  • GEOS CTM abruptly crashed after about 15 days of integration. The error message is:

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 25 PID 3877 RUNNING AT borgo007
= KILLED BY SIGNAL: 9 (Killed)

It seems that MPT might be the option (for now) for the CTM.

@mathomp4
Copy link
Member

@JulesKouatchou Well that's annoying. Can you point me to the output so I can look at the errors?

Also, if you can, can you try one more test? It would be interesting to see if MAPL 2.1 helps at all. Plus you can be the first to try the CTM with it.

For that, you'll want to clone a new CTM somewhere rather than re-use the current one. Then after cloning and doing the mepo/checkout_externals update:

@env to v3.0.0
@mapl to v2.1.0
@cmake to v2.1.0

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 Will do and let you know.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 I could not checkout @env v3.0.0:

error: pathspec 'v3.0.0' did not match any file(s) known to git

I am currently in v2.0.2.

@mathomp4
Copy link
Member

Sigh. I’m an idiot. @env is v2.1.0 and @cmake is v3.0.0.

Sorry about that. MAPL is, of course, v2.1.0

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 Here is another issue:

-- Found MKL: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_intel
_lp64.so;/usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_sequential
.so;/usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_core.so;-pthrea
d
-- Found Python: /usr/bin/python3.4 (found version "3.4.6") found components: Interpreter


-- [GEOSctm] (1.0) [f278c74]


-- [MAPL] (2.1.0) [e23f20a]
-- Found Perl: /usr/bin/perl (found version "5.18.2")
CMake Error at src/Shared/@MAPL/GMAO_pFIO/tests/CMakeLists.txt:68 (string):
string sub-command REPLACE requires at least four arguments.

-- Found PythonInterp: /usr/local/other/python/GEOSpyD/2019.10_py2.7/2020-01-15/bin/python (found version
"2.7.16")
-- Configuring incomplete, errors occurred!
See also "/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/GEOSctm/build/CMakeFiles/CMakeOutput.log"
.
See also "/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/GEOSctm/build/CMakeFiles/CMakeError.log".

@mathomp4
Copy link
Member

@JulesKouatchou I think your @cmake is at v2.1.0. That's the one that needs to be at @v3.0.0

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 I used the following:

@env v2.1.0
@cmake v3.0.0
@mapl v2.1.0

and got the same error message after about 15 days of integration.

My code is at:
/gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/GEOSctm

and my experiment directory at:
/gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/MAPL2.1/IdealPT

@mathomp4
Copy link
Member

@JulesKouatchou Actually, I forgot you were splitting errors. The real error was in the .e file.

I might have a different thing for you to try. You seem to have hit an error others sometimes do on the Haswells. Intel provided some other advice:

Please try to tune maximal virtual size of “shm-heap” by I_MPI_SHM_HEAP_VSIZE ( https://software.intel.com/en-us/mpi-developer-reference-linux-other-environment-variables )

For example, try to set I_MPI_SHM_HEAP_VSIZE=4096 (it set 4096 MB per rank for virtual size of “shm-heap”). If it will works fine please try to decrease the size to for example I_MPI_SHM_HEAP_VSIZE=2048 and so on (1024, 512, 256, ..).

Please find and tell us the minimum size of I_MPI_SHM_HEAP_VSIZE when the program works fine. We can increase default value of the I_MPI_SHM_HEAP_VSIZE in the future Intel MPI release.

@mathomp4
Copy link
Member

Note if you don't have time to run these tests, let me know and I can work with Ben or someone on and we can quickly try them all out.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 I will run the tests and let you know.

@mathomp4
Copy link
Member

Thanks. Note I found a bug with MAPL and MPT today so even moving to MPT might take a fix. Go me!

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 Conducting one 4-month run (I_MPI_SHM_HEAP_VSIZE=4096). So far at the end of the first month and still going. That is a great news as I was not able to pass 15 days of integration before.

@mathomp4
Copy link
Member

@mathomp4 Conducting one 4-month run (I_MPI_SHM_HEAP_VSIZE=4096). So far at the end of the first month and still going. That is a great news as I was not able to pass 15 days of integration before.

Good to hear!

As Intel said, if you can try lowering that in halves? The larger that is, the more memory Intel MPI reserves per-process, so we want the smallest value that works for you.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 So far the setting of I_MPI_SHM_HEAP_VSIZ with 4096, 2048, 1024, 512 and 256 are working. I will soon start testing with 128.

@mathomp4
Copy link
Member

@JulesKouatchou Thanks for doing this. Now my fear is that it'll work with I_MPI_SHM_HEAP_VSIZE=1 which would mean something a bit fundamental.

But you've already lowered it a lot which is nice.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 Unfortunately, the lowest setting might be I_MPI_SHM_HEAP_VSIZE=512. The run with 256 crashed (same error message as before) after 2 months and 27 days of integration.

@mathomp4
Copy link
Member

@mathomp4 Unfortunately, the lowest setting might be I_MPI_SHM_HEAP_VSIZE=512. The run with 256 crashed (same error message as before) after 2 months and 27 days of integration.

Still, that is good to know. I'll pass it on to Scott to test and to Intel.

@mathomp4
Copy link
Member

I suppose you could integrate that into ctm_setup or run or wherever. That way it's on by default for you. I might do the same in GCM.

- Modified the ctm_run.j file to allow the transition from 2010-2019 into 2020-2029 when dealing with MERRA2 forcing data.

**Executable Exit Status**

-  Modified the ctm_run.j file to check the exit status of the code through the existence of the EGRESS file.

**Intel MPI Environment Setting**

- Modified the ctm_setup file to set I_MPI_SHM_HEAP_VSIZE to 512 when Intel MPI is used.
- It will be used in ctm_run.j.
- The environment variable is required to prevent the code from crashing.
- The value of 512 might be increased and/or other environment variables might be set.

**Convection Refactoring**

Refactored the Convection component:

- RAS calculations are now done in the CTM CC that now provides convective mass fluxes (read in or calculated) to any component that needs them.
- AK, BK, LONS and LATS are obtained in CTM CC to carry out RAS calculations.
- Convection is always turned on regardless of the Chemistry configuration. However, if no tracer is FRIENDLY to MOIST, then Convection will be automatically turned off.
- Removed the files CTM_ConvectionStubCompMod.F90 and GenericConvectionMethod_mod.F90  that are no longer needed.
- The file GmiConvectionMethod_mod.F90 will remain until we figure out how to handle (feeding back calculation to GMI Deposition) GMI convective updrafts.
- For now, Convection only does convective transport for any Chemistry configuration.

**Refactoring of CTM CC**

- Introduced an option  (flag read_advCoreFields) to import CX/CY/MX/MY and not compute them. They are then passed to AdvCore.
- Removed the import of PLE
- Removed references to SGH
- Changed the settings so that we read PS to compute PLE (using AK and BK).
- Renamed fields that are outputs and made sure that output fields are the same as the ones used for calculations.

**MERRA2 Template File**

- Removed references of the variables SGH, PLE, DELP
- Added PS0, PS1 (to compute inside the code PLE0 and PLE1)
- PS is nor coming from files with instantaneous values (not time average ones).
- Changed the climatology file (old one did not have proper records).
@JulesKouatchou JulesKouatchou requested a review from a team as a code owner April 27, 2020 14:29
tclune
tclune previously approved these changes Apr 27, 2020
Copy link
Collaborator

@tclune tclune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cmake much improved.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 I included the I_MPI_SHM_HEAP_VSIZE=512 setting on my CTM branch jkGEOSctm_on_SLESS12. I did several "long" tests with Intel MPI to confirm that the code not longer crashes and exits gracefully.

@mathomp4
Copy link
Member

@mathomp4 I included the I_MPI_SHM_HEAP_VSIZE=512 setting on my CTM branch jkGEOSctm_on_SLESS12. I did several "long" tests with Intel MPI to confirm that the code not longer crashes and exits gracefully.

@JulesKouatchou Thanks for moving that @SETENVS as it was in the wrong place. If you can, you might want to add two more that the GCM is now running with by default:

setenv I_MPI_ADJUST_ALLREDUCE 12
setenv I_MPI_ADJUST_GATHERV 3

I think these are more important on the Skylakes, but GCM will be running with them for Intel MPI everywhere. The first fixes an issue at high-resolution for Bill, so you might never see it in a CTM, but the second one fixes an issue Ben was able to trigger at C180 at 8x48 which isn't that enormous.

I know the GCM (for all our testing) is zero-diff with them. I have to imagine the CTM would be as well, but I don't know how to test.

But that can also be a second PR if you like that I can make after you get this in?

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 Thank you for the new settings. I want to have something that works on SLESS12 first before doing internal CTM tests.

@weiyuan-jiang
Copy link

weiyuan-jiang commented Apr 27, 2020 via email

@mathomp4
Copy link
Member

@weiyuan-jiang I think the I_MPI_SHM_HEAP_VSIZE variable helps with the "unexpected failures" in the runs. The MPI_Finalize issues should be taken care of with newer MAPL with the workaround we did in MAPL_Cap.

Note that Scott is currently testing the GCM with I_MPI_SHM_HEAP_VSIZE. For him it's looking like anything other than zero is what's needed, but we might go with I_MPI_SHM_HEAP_VSIZE=512 since @JulesKouatchou found actual proof it's a useful number.

I've asked NCCS about their thoughts on it (note: this value is probably only needed on Haswell, so I'll probably code up the GCM's scripts to apply it only if Intel MPI + Haswell).

@mathomp4
Copy link
Member

Also, Bill Putman has, I think, four other variable he uses at night for his runs. I think three of them might be considered "generally useful" but I'm waiting for NCCS to respond before I add them to the GCM. If they are, I'll pass them along here as well.

    - Introduced the flag do_ctmAdvection that is by default set to true.
      When it is set to false, the Advection run method is not called.
    - Introduced an Internal State in the CTM parent gridded components.
      All the flags and other variables that were local module variables are
      now part of the internal state. This makes the code thread safe.
  - do_ctmAdvection is by default set to TRUE before even it is read in
  - Added the calls of A2D2C that was not initially captured
  - Added a printout of the settings during initializations.
  - Reorganized the section where the courant numbers and mass fluxes
    are computed to allocate variables only when it is necessary.
@mmanyin
Copy link
Contributor

mmanyin commented May 27, 2020

@JulesKouatchou Could you please update the components.yaml and Externals.cfg to reflect the versions of the repo's that you are satisfied with? (See your comment from April 18 above) Also, do we still need to use MPT to prevent crashing?

@JulesKouatchou
Copy link
Contributor Author

@mmanyin The last experiments that I did was about two weeks ago. I did several long runs and noticed that the code crashed after about 165 days of integration (in one job segment) even by increasing the value of I_MPI_SHM_HEAP_VSIZE. @mathomp4 mentioned that Bill is using other settings that we need to include too.
In another matter, the code is still not exiting gracefully when I use Intel MPI.

Do you want be to add the versions below as default for the CTM?

@env v2.1.0
@cmake v3.0.0
@mapl v2.1.0

@mathomp4
Copy link
Member

@mmanyin The last experiments that I did was about two weeks ago. I did several long runs and noticed that the code crashed after about 165 days of integration (in one job segment) even by increasing the value of I_MPI_SHM_HEAP_VSIZE. @mathomp4 mentioned that Bill is using other settings that we need to include too.
In another matter, the code is still not exiting gracefully when I use Intel MPI.

Do you want be to add the versions below as default for the CTM?

@env v2.1.0
@cmake v3.0.0
@mapl v2.1.0

I think the graceful exit is probably due to not new enough MAPL. We fixed that we think in 2.1.3. The GCM is currently using (in master, not yet in release):

  • ESMA_env v2.1.5
  • ESMA_cmake v3.0.3
  • MAPL v2.1.4

The other Bill flags probably won't help much. He has some that I think only affect high res runs. The important ones are the I_MPI_ADJUST_ALLREDUCE, I_MPI_ADJUST_GATHERV, and the I_MPI_SHM_HEAP_VSIZE we think.

@JulesKouatchou
Copy link
Contributor Author

@mathomp4 and @mmanyin I used:

  • ESMA_env v2.1.5
  • ESMA_cmake v3.0.3
  • MAPL v2.1.4

and also the settings I_MPI_ADJUST_ALLREDUCE, I_MPI_ADJUST_GATHERV, and I_MPI_SHM_HEAP_VSIZE (512, 1024, 2048). The code exited gracefully but still crashed at the same integration date regardless the value of I_MPI_SHM_HEAP_VSIZE.

@mathomp4 mathomp4 changed the base branch from master to main June 22, 2020 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants