Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help with FATES-Hydro SP mode run in Amazon basin? / Hydro SP mode restart problem? #1175

Closed
jennykowalcz opened this issue Mar 22, 2024 · 9 comments

Comments

@jennykowalcz
Copy link

jennykowalcz commented Mar 22, 2024

ETA this appears to be a restart issue. Is this a known issue?


I'm trying to do a pair of SP mode runs with Hydro on/off over the Amazon basin. I haven't run in SP mode before so it's entirely possible that there's something wrong with my setup. Upon restart after a successful 10 year run, Hydro crashes with large negative leaf and stem water content:

18:  Could not find a stable solution for hydro 1D solve
 18:  
 18:  error code:            1
 18:  error diag:   0.000000000000000E+000  0.000000000000000E+000
 18:   0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
 18:  lat:  -0.471204188481675      longitidue:   302.500000000000     
 18:  is recruitment:  F
 18:  layer:            1
 18:  wb_step_err =  -3.900982086510368E-004
 18:  q_top_eff*dt_step =   9.462095125488155E-006
 18:  w_tot_beg =   -36531.0304772177     
 18:  w_tot_end =   -36531.0308767780     
 18:  leaf water:   -5532.68297630264       kg/plant
 18:  stem_water:   -31000.9848186136       kg/plant
 18:  troot_water:  -0.121097530349677     
 18:  aroot_water:  -1.484245688044263E-003
 18:  LWP:   -1708641.39556254     
 18:  dbh:    61.1492471203896     
 18:  pft:            1
 18:  z nodes:    32.5554586118159        16.2527293059080     
 18:  -0.222826201934367      -2.254589358796573E-002 -2.254589358796573E-002
 18:  psi_z:   0.319043494295329       0.159276747144759      -2.183696720749140E-003
 18:  -2.209497542935424E-004  0.000000000000000E+000
 18:  vol,    theta,   H,  Psi,     kmax-
 18:  flux:            9.462095125488155E-006
 18:  l:  0.105397205791370       -52.4936399856215       -1441265.20855828     
 18:   -1441265.52760178     
 18:                          0.121139556003566     
 18:  s:  0.589193527671826       -52.6159629436405       -1340287.62511622     
 18:   -1340287.78439297     
 18:                          0.594967528140614     
 18:  t:  2.301545426864054E-006  -52.6157463312280       -1340282.29165635     
 18:                          0.251617751505223     
 18:  a:  3.663185927743921E-007  -2.47336174472767       -38997.5954054888     
 18:                     in:  0.163491354918982     
 18:                    out:  1.246982693212379E-006
 18:  r1:   6.22113574328013       4.435397942069152E-004 -1.338559906910513E+022
 18:                          0.000000000000000E+000
 18:  r2:  0.000000000000000E+000  0.000000000000000E+000  4.940656458412465E-324
 18:                          0.000000000000000E+000
 18:  r3:  -1441265.52760178       0.000000000000000E+000  6.299676333464739E-315
 18:                                             NaN
 18:  r4:  -1340287.78439297       0.000000000000000E+000  3.952525166729972E-322
 18:                          0.000000000000000E+000
 18:  r5:  -1340282.28947266       -52.4592053472706       0.000000000000000E+000
 18:  kmax_aroot_radial_out:   1.246985969321212E-006
 18:  surf area of root:   1.246985969321212E-002

I am running with a single PFT, the broadleaf evergreen tropical tree (PFT 1, extracted from fates_params_api25.5.0_12pft_c230710.nc) and I have fates_hlm_pft_map set to all 1's. LAI looks rather low in some regions but not where the crash occurred (red circle)
FATES plots

The soil looks wet enough and the water balance looks okay, I think, at the month before the crash, so I am perplexed:
FATES plots (1)

Is anything obviously wrong with my SP mode setup and/or has anyone encountered something like this with hydro SP mode?

@jennykowalcz
Copy link
Author

I should have added, the above is with https://github.com/NGEET/fates/releases/tag/sci.1.68.2_api.30.0.0 and #1156 .

I just gave a quick try with https://github.com/NGEET/fates/releases/tag/sci.1.70.0_api.32.0.0_tools.1.1.0 (and PFT 1 extracted from fates_params_api.32.0.0_12pft_c231215.nc) and it fails upon restart in SP mode with both hydro on and hydro off ☹️ The error with hydro on is:


  0: corrupted size vs. prev_size
128: free(): invalid pointer
128: forrtl: error (76): Abort trap signal
128: Image              PC                Routine            Line        Source             
128: libpthread-2.31.s  0000148D72BEF910  Unknown               Unknown  Unknown
128: libc-2.31.so       0000148D72651D2B  gsignal               Unknown  Unknown
128: libc-2.31.so       0000148D726533E5  abort                 Unknown  Unknown
128: libc-2.31.so       0000148D72697C27  Unknown               Unknown  Unknown
128: libc-2.31.so       0000148D7269FCCA  Unknown               Unknown  Unknown
128: libc-2.31.so       0000148D726A1774  Unknown               Unknown  Unknown
128: libpnetcdf_intel.  0000148D7772D6FC  ncmpio_free_NC_va     Unknown  Unknown
128: libpnetcdf_intel.  0000148D7772D85C  ncmpio_free_NC_va     Unknown  Unknown
128: libpnetcdf_intel.  0000148D77729921  ncmpio_free_NC        Unknown  Unknown
128: libpnetcdf_intel.  0000148D77729CAC  ncmpio_close          Unknown  Unknown
128: libpnetcdf_intel.  0000148D7765E3C2  ncmpi_close           Unknown  Unknown
128: e3sm.exe           000000000175DD14  Unknown               Unknown  Unknown
128: e3sm.exe           00000000016B8473  piolib_mod_mp_clo        1108  piolib_mod.F90
128: e3sm.exe           00000000006AC20E  restfilemod_mp_re        1180  restFileMod.F90
128: e3sm.exe           00000000006A9825  restfilemod_mp_re         438  restFileMod.F90
128: e3sm.exe           000000000056DAA5  elm_driver_mp_elm        1529  elm_driver.F90
128: e3sm.exe           0000000000551710  lnd_comp_mct_mp_l         617  lnd_comp_mct.F90
128: e3sm.exe           000000000045698E  component_mod_mp_         757  component_mod.F90
128: e3sm.exe           000000000043637B  cime_comp_mod_mp_        2915  cime_comp_mod.F90
128: e3sm.exe           0000000000456622  MAIN__                    153  cime_driver.F90
128: e3sm.exe           0000000000433AED  Unknown               Unknown  Unknown
128: libc-2.31.so       0000148D7263C24D  __libc_start_main     Unknown  Unknown
128: e3sm.exe           0000000000433A1A  Unknown               Unknown  Unknown
srun: error: nid006022: task 128: Aborted
srun: Terminating StepId=23294988.0
  0: slurmstepd: error: *** STEP 23294988.0 ON nid005300 CANCELLED AT 2024-03-22T17:19:21 ***

And with hydro off:

163:  A FATES iotype was created that was not registerred
163:  in CLM.:
163:  � 
163: �

163: �`�
        �
163:  �@
163:  ENDRUN:
163:  ERROR in elmfates_interfaceMod.F90 at line 1665                                
163:                                                                                 
163:                                                                                 
163:                                                                                 
163:                                                                                 
163:                                                                                 
163:                                        
163:  ERROR: Unknown error submitted to shr_abort_abort.
163: Image              PC                Routine            Line        Source             
163: e3sm.exe           0000000001428C4D  shr_abort_mod_mp_         114  shr_abort_mod.F90
163: e3sm.exe           000000000055DC47  abortutils_mp_end          43  abortutils.F90
163: e3sm.exe           00000000005E8739  elmfatesinterface        1665  elmfates_interfaceMod.F90
163: e3sm.exe           00000000006A97BD  restfilemod_mp_re         414  restFileMod.F90
163: e3sm.exe           000000000056DAA5  elm_driver_mp_elm        1529  elm_driver.F90
163: e3sm.exe           0000000000551710  lnd_comp_mct_mp_l         617  lnd_comp_mct.F90
163: e3sm.exe           000000000045698E  component_mod_mp_         757  component_mod.F90
163: e3sm.exe           000000000043637B  cime_comp_mod_mp_        2915  cime_comp_mod.F90
163: e3sm.exe           0000000000456622  MAIN__                    153  cime_driver.F90
163: e3sm.exe           0000000000433AED  Unknown               Unknown  Unknown
163: libc-2.31.so       0000149CAE83C24D  __libc_start_main     Unknown  Unknown
163: e3sm.exe           0000000000433A1A  Unknown               Unknown  Unknown
163: MPICH ERROR [Rank 163] [job id 23295536.0] [Fri Mar 22 10:34:23 2024] [nid004966] - Abort(1001) (rank 163 in com
m 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 163
163: 
163: aborting job:
163: application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 163
srun: error: nid004966: task 163: Exited with exit code 255
srun: Terminating StepId=23295536.0
  0: slurmstepd: error: *** STEP 23295536.0 ON nid004758 CANCELLED AT 2024-03-22T17:34:32 ***

@rgknox
Copy link
Contributor

rgknox commented Mar 25, 2024

Some of the issues here seem to be associated with incompatible e3sm/fates branches. Here are my recommendations:

This branch of E3SM is compatible with API 33:
https://github.com/rgknox/E3SM/tree/lnd/fates-twostream

Make sure to run:

git submodule update --init --recursive

after you check it out. When that is done, manually check out the latest fates API 33 tag:
https://github.com/NGEET/fates/releases/tag/sci.1.72.1_api.33.0.0

And go from there.

I'm trying to generate errors to trouble shoot this with you @jennykowalcz . I have an unstructured grid setup in a similar region with dry and wet sites. I'm also trying hydro and sp.

@jennykowalcz
Copy link
Author

Thanks @rgknox! I'll give it a try with API 33

@jennykowalcz
Copy link
Author

Hmm, I get a similar error upon restart from the SP hydro simulation with API 33 (below) as I did with API 30.


 64:  Could not find a stable solution for hydro 1D solve
 64:  
 64:  error code:            1
 64:  error diag:   0.000000000000000E+000  0.000000000000000E+000
 64:   0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
 64:  lat:    2.35602094240838      longitidue:   297.500000000000     
 64:  is recruitment:  F
 64:  layer:            1
 64:  wb_step_err =   1.431012141802750E-003
 64:  q_top_eff*dt_step =   7.578653276834656E-006
 64:  w_tot_beg =   -24326.9117333547     
 64:  w_tot_end =   -24326.9103099212     
 64:  leaf water:   -3861.09336995793       kg/plant
 64:  stem_water:   -20468.7824211296       kg/plant
 64:  troot_water:  -6.938551238789790E-002
 64:  aroot_water:  -2.643979522735622E-003
 64:  LWP:   -1076719.05688996     
 64:  dbh:    65.7814845895811     
 64:  pft:            1
 64:  z nodes:    34.1153873740694        17.0326936870347     
 64:  -0.222826201934367      -2.254589358796573E-002 -2.254589358796573E-002
 64:  psi_z:   0.334330796264112       0.166920398129150      -2.183696837164462E-003
 64:  -2.209497579315212E-004  0.000000000000000E+000
 64:  vol,    theta,   H,  Psi,     kmax-
 64:  flux:            7.578653276834656E-006
 64:  l:  0.124492473363023       -31.0146731417158       -853441.282108291     
 64:   -853441.616439087     
 64:                          0.131527367910411     
 64:  s:  0.678955521931992       -30.1474570276488       -770317.281594762     
 64:   -770317.448515160     
 64:                          0.648231414230674     
 64:  t:  2.301545426864054E-006  -30.1473573269585       -770314.921539325     
 64:                          0.273354315070265     
 64:  a:  3.663185927743921E-007  -1.76807802041734       -28403.5256965200     
 64:                     in:  0.167801571713568     
 64:                    out:  1.246982937515623E-006
 64:  r1:   6.84062848959576       4.435397901812466E-004 -1.380757471599267E+022
 64:                          0.000000000000000E+000
 64:  r2:  0.000000000000000E+000  0.000000000000000E+000  3.952525166729972E-322
 64:                          0.000000000000000E+000
 64:  r3:  -853441.616439087       0.000000000000000E+000  0.000000000000000E+000
 64:                                             NaN
 64:  r4:  -770317.448515160       0.000000000000000E+000  2.920560865014090E-317
 64:                          0.000000000000000E+000
 64:  r5:  -770314.919355628       -30.9885955273452       1.124331281805253E-310
 64:  kmax_aroot_radial_out:   1.246985969321212E-006
 64:  surf area of root:   1.246985969321212E-002
 64:  aroot_frac_plant:   1.517045462491681E-002   19.8463980983696     
 64:    1308.22698390151     
 64:  kmax_upper_shell:    88.6950499175435     
 64:  kmax_lower_shell:    8.29024808654960     
 64:  
 64:  tree lai:    4.20707664262663       m2/m2 crown
 64:  area and area to volume ratios
 64:  
 64:  a:  3.663185927743921E-007
 64:                          1.246985969321212E-002
 64:  r1:   6.84062848959576     
 64:                           41.3041608687354     
 64:  r2:  0.000000000000000E+000
 64:                        
 64:  r3:  -853441.616439087     
 64:                        
 64:  r4:  -770317.448515160     
 64:                        
 64:  r5:  -770314.919355628     
 64:  inner shell kmaxs:    88.6950499175435        62.7917623098249     
 64:    70.3543488352819        64.6104140473028        51.0021530880871     
 64:    34.9953643832409        34.2186386666329        33.0585874969261     
 64:    19.8353005916627     
 64:  ENDRUN:
 64:  ERROR in FatesPlantHydraulicsMod.F90 at line 3456  

I'll try the non-hydro case when perlmutter comes back up.

@jennykowalcz
Copy link
Author

@rgknox am I right in thinking that this should be a pretty low-priority bug since SP mode is computationally cheap and doesn't need to run for a lot of model years, so being able to restart runs is not very important?

@rgknox
Copy link
Contributor

rgknox commented Mar 28, 2024

It does seem like a useful way of testing hydro, which is slow and would benefit from having accurate restarting.

@jennykowalcz
Copy link
Author

Update: @rgknox has determined that hydro is not compatible with SP mode at present. He and @glemieux are adding a check that will stop the model if one tries to run with both hydro and SP mode on.

@glemieux
Copy link
Contributor

glemieux commented Apr 10, 2024

Update: @rgknox has determined that hydro is not compatible with SP mode at present. He and @glemieux are adding a check that will stop the model if one tries to run with both hydro and SP mode on.

For reference this will come into ctsm via the API 35 update (see ESCOMP/CTSM#2436), which could be integrated as soon as the end of this week. The e3sm API 35 update will come at a later date.

@glemieux
Copy link
Contributor

glemieux commented Jul 18, 2024

See #1227 for updates on making hydro and sp mode work together.

@github-project-automation github-project-automation bot moved this from ❕Todo to ✔ Done in FATES issue board Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants