-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
March 31, 2020 - Jules Kouatchou #27
base: main
Are you sure you want to change the base?
Conversation
ctm_setup: - New setting of environment variables (SETENV.commands) for the use of MPT ctm_run.j: - The @SETENVS tag was placed before the MPI run command is issued. - Checked the exist status of the executable using the EGRESS file.
@mathomp4 There was a build error; "No configuration was found in your project. Please refer to https://circleci.com/docs/2.0/ to get started with your configuration." Is the CI stuff working properly for CTM ? thanks |
@JulesKouatchou Do I understand correctly, you have changed the run-time environment variables to be appropriate for using MPT? How do we compile w/ MPT? I thought that the default was to compile with Intel-MPI. |
@mmanyin Until #23 is merged in, there is no way for CircleCI to find a configuration since it only exists on a branch, not master. I had set up CircleCI to follow GEOSctm thinking the config file would get it. Since it might be a while, would you like me to turn off CircleCI following GEOSctm? |
Actually I will go ahead with #23 . Sorry for the confusion! |
Well, that was unexpected. @JulesKouatchou When you have a chance can you do a fresh clone of GEOSctm and then a fresh checkout of your branch, and then try running with MPT? I just did a "resolve conflict" for your branch (so it could merge in) and, weirdly, Git seems to say now that the On the plus side, @mmanyin, it looks like that "resolve conflict" is letting CircleCI run! |
@mathomp4 I will and let you know. |
@mathomp4 When I dod:
Intel MPI get loaded. I need MPT. |
Jules, You'll need to:
to get MPT as an MPI stack |
@mathomp4 Here are my steps:
cd GEOSctm Things appear to be fine. I am currently doing a long run to make sure that the code does not crash. Thanks. |
Sounds good! If all works, you can set the appropriate "required label". I'm guessing |
@mathomp4 This is the first step. I want to code to be able to compile and run. Ideally, I want the same code to compile and run on SLESS11 nodes too (though they will disappear soon). I will then be able to do the comparison. |
@mathomp4 My long run did not have any issue. |
Jules, We can do that for sure, but then when the hundreds of Skylake nodes go online for the general user they will not be able to use them. Intel MPI allows users to use every node on NCCS. Before we issue that, ctm_run.j should be altered so that if anyone ever tries to run on the Skylakes at NCCS with MPT, the CTM must immediately error out with a non-zero status code. And maybe a note saying what’s happening so that the user doesn’t try to contact NCCS or the SI Team. I mean, the job will crash anyway, but it will be an obscure looking loader error I think. |
@mathomp4 Sorry that I coming back to it now. Wondering if there could be (for now) a flag that set MPT as first option and Intel MPI as the second option. I am willing to modify the ctm_run.j file if I know what options are available in g5_modules. |
@JulesKouatchou I don't think so, not as long as GEOS uses If you require MPT, I can create a special branch/tag of ESMA_env for you. You should also contact NCCS and let them know that Intel MPI does not work for your code. They will be interested in this and would probably want to contact Intel regarding the fault. |
I have seen Intel MPI crash during Finalize, when running the GCM under SLES12. @JulesKouatchou please CC me when you contact NCCS about this problem; I will open a case as well, and CC you. |
@JulesKouatchou We might have a workaround for the MPI_Finalize issue. I found an MPI command which essentially "turns off error output" and @bena-nasa seemed to be able to show it helped. We are looking at adding it into MAPL with some good protections so we don't turn off all MPI errors. |
@mathomp4 Great! Let me know when the workaround is ready so that I can test it. |
Jules, try out MAPL v2.0.6 (aka Note, you're behind on a lot of things in CTM in it's mepo/externals bits) but v2.0.0 and v2.0.6 are still similar. |
@mathomp4 Here is a summary of what happened when I used MAPL v2.0.6.
===================================================================================
|
@JulesKouatchou Well that's annoying. Can you point me to the output so I can look at the errors? Also, if you can, can you try one more test? It would be interesting to see if MAPL 2.1 helps at all. Plus you can be the first to try the CTM with it. For that, you'll want to clone a new CTM somewhere rather than re-use the current one. Then after cloning and doing the mepo/checkout_externals update: |
@mathomp4 Will do and let you know. |
@mathomp4 Here is another issue: -- Found MKL: /usr/local/intel/2018/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64/libmkl_intel -- [GEOSctm] (1.0) [f278c74] -- [MAPL] (2.1.0) [e23f20a] -- Found PythonInterp: /usr/local/other/python/GEOSpyD/2019.10_py2.7/2020-01-15/bin/python (found version |
@JulesKouatchou I think your @cmake is at v2.1.0. That's the one that needs to be at @v3.0.0 |
@JulesKouatchou Actually, I forgot you were splitting errors. The real error was in the .e file. I might have a different thing for you to try. You seem to have hit an error others sometimes do on the Haswells. Intel provided some other advice:
|
Note if you don't have time to run these tests, let me know and I can work with Ben or someone on and we can quickly try them all out. |
@mathomp4 I will run the tests and let you know. |
Thanks. Note I found a bug with MAPL and MPT today so even moving to MPT might take a fix. Go me! |
@mathomp4 Conducting one 4-month run (I_MPI_SHM_HEAP_VSIZE=4096). So far at the end of the first month and still going. That is a great news as I was not able to pass 15 days of integration before. |
Good to hear! As Intel said, if you can try lowering that in halves? The larger that is, the more memory Intel MPI reserves per-process, so we want the smallest value that works for you. |
@mathomp4 So far the setting of I_MPI_SHM_HEAP_VSIZ with 4096, 2048, 1024, 512 and 256 are working. I will soon start testing with 128. |
@JulesKouatchou Thanks for doing this. Now my fear is that it'll work with But you've already lowered it a lot which is nice. |
@mathomp4 Unfortunately, the lowest setting might be I_MPI_SHM_HEAP_VSIZE=512. The run with 256 crashed (same error message as before) after 2 months and 27 days of integration. |
Still, that is good to know. I'll pass it on to Scott to test and to Intel. |
I suppose you could integrate that into |
- Modified the ctm_run.j file to allow the transition from 2010-2019 into 2020-2029 when dealing with MERRA2 forcing data. **Executable Exit Status** - Modified the ctm_run.j file to check the exit status of the code through the existence of the EGRESS file. **Intel MPI Environment Setting** - Modified the ctm_setup file to set I_MPI_SHM_HEAP_VSIZE to 512 when Intel MPI is used. - It will be used in ctm_run.j. - The environment variable is required to prevent the code from crashing. - The value of 512 might be increased and/or other environment variables might be set. **Convection Refactoring** Refactored the Convection component: - RAS calculations are now done in the CTM CC that now provides convective mass fluxes (read in or calculated) to any component that needs them. - AK, BK, LONS and LATS are obtained in CTM CC to carry out RAS calculations. - Convection is always turned on regardless of the Chemistry configuration. However, if no tracer is FRIENDLY to MOIST, then Convection will be automatically turned off. - Removed the files CTM_ConvectionStubCompMod.F90 and GenericConvectionMethod_mod.F90 that are no longer needed. - The file GmiConvectionMethod_mod.F90 will remain until we figure out how to handle (feeding back calculation to GMI Deposition) GMI convective updrafts. - For now, Convection only does convective transport for any Chemistry configuration. **Refactoring of CTM CC** - Introduced an option (flag read_advCoreFields) to import CX/CY/MX/MY and not compute them. They are then passed to AdvCore. - Removed the import of PLE - Removed references to SGH - Changed the settings so that we read PS to compute PLE (using AK and BK). - Renamed fields that are outputs and made sure that output fields are the same as the ones used for calculations. **MERRA2 Template File** - Removed references of the variables SGH, PLE, DELP - Added PS0, PS1 (to compute inside the code PLE0 and PLE1) - PS is nor coming from files with instantaneous values (not time average ones). - Changed the climatology file (old one did not have proper records).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cmake much improved.
@mathomp4 I included the I_MPI_SHM_HEAP_VSIZE=512 setting on my CTM branch jkGEOSctm_on_SLESS12. I did several "long" tests with Intel MPI to confirm that the code not longer crashes and exits gracefully. |
@JulesKouatchou Thanks for moving that
I think these are more important on the Skylakes, but GCM will be running with them for Intel MPI everywhere. The first fixes an issue at high-resolution for Bill, so you might never see it in a CTM, but the second one fixes an issue Ben was able to trigger at C180 at 8x48 which isn't that enormous. I know the GCM (for all our testing) is zero-diff with them. I have to imagine the CTM would be as well, but I don't know how to test. But that can also be a second PR if you like that I can make after you get this in? |
@mathomp4 Thank you for the new settings. I want to have something that works on SLESS12 first before doing internal CTM tests. |
@weiyuan-jiang I think the Note that Scott is currently testing the GCM with I've asked NCCS about their thoughts on it (note: this value is probably only needed on Haswell, so I'll probably code up the GCM's scripts to apply it only if Intel MPI + Haswell). |
Also, Bill Putman has, I think, four other variable he uses at night for his runs. I think three of them might be considered "generally useful" but I'm waiting for NCCS to respond before I add them to the GCM. If they are, I'll pass them along here as well. |
- Introduced the flag do_ctmAdvection that is by default set to true. When it is set to false, the Advection run method is not called. - Introduced an Internal State in the CTM parent gridded components. All the flags and other variables that were local module variables are now part of the internal state. This makes the code thread safe.
- do_ctmAdvection is by default set to TRUE before even it is read in - Added the calls of A2D2C that was not initially captured - Added a printout of the settings during initializations. - Reorganized the section where the courant numbers and mass fluxes are computed to allocate variables only when it is necessary.
@JulesKouatchou Could you please update the components.yaml and Externals.cfg to reflect the versions of the repo's that you are satisfied with? (See your comment from April 18 above) Also, do we still need to use MPT to prevent crashing? |
@mmanyin The last experiments that I did was about two weeks ago. I did several long runs and noticed that the code crashed after about 165 days of integration (in one job segment) even by increasing the value of I_MPI_SHM_HEAP_VSIZE. @mathomp4 mentioned that Bill is using other settings that we need to include too. Do you want be to add the versions below as default for the CTM? |
I think the graceful exit is probably due to not new enough MAPL. We fixed that we think in 2.1.3. The GCM is currently using (in
The other Bill flags probably won't help much. He has some that I think only affect high res runs. The important ones are the I_MPI_ADJUST_ALLREDUCE, I_MPI_ADJUST_GATHERV, and the I_MPI_SHM_HEAP_VSIZE we think. |
@mathomp4 and @mmanyin I used:
and also the settings I_MPI_ADJUST_ALLREDUCE, I_MPI_ADJUST_GATHERV, and I_MPI_SHM_HEAP_VSIZE (512, 1024, 2048). The code exited gracefully but still crashed at the same integration date regardless the value of I_MPI_SHM_HEAP_VSIZE. |
ctm_setup:
ctm_run.j: