Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to mpi-serial lib - how to run CLM/cesm.exe directly #7

Open
serbinsh opened this issue Jun 19, 2019 · 6 comments
Open
Assignees
Labels
enhancement New feature or request high priority

Comments

@serbinsh
Copy link
Owner

See: ESCOMP/CTSM#614

@serbinsh serbinsh self-assigned this Jun 19, 2019
@serbinsh serbinsh added enhancement New feature or request high priority labels Jun 19, 2019
@serbinsh
Copy link
Owner Author

Here is the issue:

We need to run the exe directly or invoke the exe using something other than mpi-run since we can nest mpirun calls

Case submit runs like this, even when compiling libs in serial

WARNING: CLM is starting up from a cold state
   Calling /ctsm/cime/src/components/stub_comps/sice/cime_config/buildnml
   Calling /ctsm/cime/src/components/stub_comps/socn/cime_config/buildnml
   Calling /ctsm/components/mosart//cime_config/buildnml
   Calling /ctsm/cime/src/components/stub_comps/sglc/cime_config/buildnml
   Calling /ctsm/cime/src/components/stub_comps/swav/cime_config/buildnml
   Calling /ctsm/cime/src/components/stub_comps/sesp/cime_config/buildnml
   Calling /ctsm/cime/src/drivers/mct/cime_config/buildnml
Finished creating component namelists
-------------------------------------------------------------------------
 - Prestage required restarts into /ctsm_output/CLM5_1560968877/run
 - Case input data directory (DIN_LOC_ROOT) is /data/
 - Checking for required input datasets in DIN_LOC_ROOT
-------------------------------------------------------------------------
2019-06-19 18:53:24 MODEL EXECUTION BEGINS HERE
run command is mpirun -np 1 -npernode 4 /ctsm_output/CLM5_1560968877/bld/cesm.exe  >> cesm.log.$LID 2>&1

and we can no longer run as

130-199-9-235:release-clm5.0.15_serial sserbin$ docker run -t -i --hostname=modex --user clmuser -v ~/Data/cesm_input_data:/data -v ~/scratch:/ctsm_output serbinsh/ctsm_containers:ctsm-release-clm5.0.15 /bin/sh -c 'cd /ctsm_output/CLM5_1560968877/ && ./bld/cesm.exe'
 ERROR: (cime_cpl_init) :: namelist read returns an end of file or end of record condition
#0  0x7F7D1D292BF0
#1  0x9844C0 in __shr_abort_mod_MOD_shr_abort_backtrace
#2  0x9846DB in __shr_abort_mod_MOD_shr_abort_abort
#3  0x41BAB2 in __cime_comp_mod_MOD_cime_pre_init1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

So how can we mirror a case.submit run but where it doesnt use mpirun?

@serbinsh
Copy link
Owner Author

Related to: #4

@thiagoveloso
Copy link

Please check my update at ESCOMP/CTSM#614 (comment)

@serbinsh
Copy link
Owner Author

serbinsh commented Jun 19, 2019

@thiagoveloso OK, that is very very helpful! So it is technically feasible I am just doing something wrong.

for example:

130-199-9-235:release-clm5.0.15_serial sserbin$ docker run -t -i --hostname=modex --user clmuser -v ~/Data/cesm_input_data:/data -v ~/scratch:/ctsm_output serbinsh/ctsm_containers:ctsm-release-clm5.0.15 /bin/sh -c 'cd /ctsm_output/CLM5_1560968877/ && ./preview_run'
CASE INFO:
  nodes: 1
  total tasks: 1
  tasks per node: 1
  thread count: 1

BATCH INFO:
  FOR JOB: case.run
    ENV:
      Setting Environment HDF5_HOME=/usr/local/hdf5
      Setting Environment NETCDF_PATH=/usr/local/netcdf
      Setting Environment OMP_NUM_THREADS=1
    SUBMIT CMD:
      None

MPIRUN:
  mpirun -np 1 -npernode 4 /ctsm_output/CLM5_1560968877/bld/cesm.exe  >> cesm.log.$LID 2>&1
130-199-9-235:release-clm5.0.15_serial sserbin$

So what I need to figure out is how you setup your machine files such that it gives you

MPIRUN:
    /no_backup/GroupData/CLM/scratch/PTCLM5BGC/bld/cesm.exe  >> cesm.log.$LID 2>&1

instead of something like

MPIRUN:
mpirun -np 1 -npernode 4 /ctsm_output/CLM5_1560968877/bld/cesm.exe  >> cesm.log.$LID 2>&1

Would you be willing to share your xml files, specifically config_machines, config_batch, and config_compilers? or just the XML blocks pertaining to your machine?

@thiagoveloso
Copy link

@serbinsh just sent you an e-mail showing how I set up CLM5 on my local HPC

@thiagoveloso
Copy link

@serbinsh, I lost access to the email address from the institution I used to work at when I sent you the above messages.

However, I currently need to build PTCLM5 on another machine. Would you be able to share here the contents of the email I sent you back in the day?

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority
Projects
None yet
Development

No branches or pull requests

2 participants