Bug fix restart reproducibility of OBC BGC data#1121
Conversation
The time reference for CS%dt_obc_seg_time was initialized from Time (the start of the current run segment) rather than Time_init (the overall simulation start time). After a restart, Time > Time_init, so the first scheduled update could happen at a different phase than it would in a continuous run, breaking restart reproducibility. The fix anchors dt_obc_seg_time to Time_init so that the update schedule is invariant to where in the simulation a restart occurs. The old behavior is preserved by OBC_BGC_TIME_REF_BUG.
When DT_OBC_SEG_UPDATE_OBGC exceeds DT_TRACER_ADVECT, BGC OBC external data (%t) read at initialization cannot be recovered after a restart because it is not saved in restart files. This means the first tracer advection step after a restart may use incorrect boundary data. Issue a FATAL error when this condition is detected on a restart run, and a WARNING on a new run to alert the user that any future restart will fail. Both messages direct users to set DT_OBC_SEG_UPDATE_OBGC to DT_TRACER_ADVECT or 0 until %t is added to restart files.
We should run some timing tests, but I think deprecating and later removing the DT_OBC_SEG_UPDATE_OBGC parameter could be a good idea. This setting exists because the boundary condition data update was really expensive in early versions of the MOM6+COBALT models. However, I tracked this down to a bug in FMS and it was fixed a long time ago in NOAA-GFDL/FMS#811. It's likely that we can have the BGC tracer boundary data updated at the same time as everything else without slowing down much now. |
Thanks @andrew-c-ross for providing the context. In terms of timing, there will be a following PR, addressing the issue when the tracer data should be updated. Currently, by default, all tracer data (thermodynamics + BGC) is read and update every dynamic time step, which is high inefficient and unnecessary. For NWA12, this is twice more file read, remapping and data copy than needed. After all of this, we will have less frequent T/S and BGC tracer updates. For NWA12, which I think is using DT_OBC_SEG_UPDATE_OBGC = DT_TRACER_ADVECT, there will probably be a modest gain of performance. |
DT_OBC_SEG_UPDATE_OBGCcontrols how often BGC tracer data is read from OBC segment files. As it is now, there are at least two problems with it that could lead to restart reproducibility issues and this PR partially addresses them.OBC_BGC_TIME_REF_BUG)The time reference for
CS%dt_obc_seg_timewas initialized fromTime(the start of the current run segment) rather thanTime_init(the overall simulation start time). After a restart, Time > Time_init, so the first scheduled update could happen at a different phase than it would in a continuous run, breaking restart reproducibility.The fix b09481e anchors
dt_obc_seg_timetoTime_initso the update schedule is phase-invariant across restarts. The old behavior is preserved byOBC_BGC_TIME_REF_BUG, which defaults toTrue.DT_OBC_SEG_UPDATE_OBGC > DT_TRACER_ADVECTBGC OBC external data (
%t) is not saved in restart files. WhenDT_OBC_SEG_UPDATE_OBGC > DT_TRACER_ADVECT, the model cannot reconstruct the boundary state at restart, as it is agnostic of the thickness used for remapping during the last update. So the first tracer advection step after any restart may use incorrect data regardless of the time-reference fix above.To protect users from this silent error:
Both messages direct users to set
DT_OBC_SEG_UPDATE_OBGC = DT_TRACER_ADVECTor0until%tis added to restart files.Comments:
DT_OBC_SEG_UPDATE_OBGC, which is probably very rare. By default,DT_OBC_SEG_UPDATE_OBGCis baroclinicDTand in practice, it is usually (and should be) an integer multiple ofDT_TRACER_ADVECT.DT_OBC_SEG_UPDATE_OBGCisDT_TRACER_ADVECTuntil external data is saved in restart files. Doing so would require the model to carry two additional 4D arrays (the %t-counterpart ofOBC%tres_[xy]) across restarts, which is a substantial memory cost. Given that the parameter is currently broken for all but a narrow range of values, it may be worth a broader discussion about whether to deprecate it entirely.