Skip to content

Bug fix restart reproducibility of OBC BGC data#1121

Open
herrwang0 wants to merge 2 commits into
NOAA-GFDL:dev/gfdlfrom
herrwang0:fix-obc-bgc-time-ref
Open

Bug fix restart reproducibility of OBC BGC data#1121
herrwang0 wants to merge 2 commits into
NOAA-GFDL:dev/gfdlfrom
herrwang0:fix-obc-bgc-time-ref

Conversation

@herrwang0
Copy link
Copy Markdown

DT_OBC_SEG_UPDATE_OBGC controls how often BGC tracer data is read from OBC segment files. As it is now, there are at least two problems with it that could lead to restart reproducibility issues and this PR partially addresses them.

  1. Time reference bug (OBC_BGC_TIME_REF_BUG)

The time reference for CS%dt_obc_seg_time was initialized from Time (the start of the current run segment) rather than Time_init (the overall simulation start time). After a restart, Time > Time_init, so the first scheduled update could happen at a different phase than it would in a continuous run, breaking restart reproducibility.

The fix b09481e anchors dt_obc_seg_time to Time_init so the update schedule is phase-invariant across restarts. The old behavior is preserved by OBC_BGC_TIME_REF_BUG, which defaults to True.

  1. Deprecate the use of DT_OBC_SEG_UPDATE_OBGC > DT_TRACER_ADVECT

BGC OBC external data (%t) is not saved in restart files. When DT_OBC_SEG_UPDATE_OBGC > DT_TRACER_ADVECT, the model cannot reconstruct the boundary state at restart, as it is agnostic of the thickness used for remapping during the last update. So the first tracer advection step after any restart may use incorrect data regardless of the time-reference fix above.

To protect users from this silent error:

  • A FATAL is issued when this condition is detected on a restart run.
  • A WARNING is issued on a new run to alert the user that any future restart from it will fail.

Both messages direct users to set DT_OBC_SEG_UPDATE_OBGC = DT_TRACER_ADVECT or 0 until %t is added to restart files.

Comments:

  1. The time reference bug only affects the cases when the difference between restart time and initial time is not an integer multiple of DT_OBC_SEG_UPDATE_OBGC, which is probably very rare. By default, DT_OBC_SEG_UPDATE_OBGC is baroclinic DT and in practice, it is usually (and should be) an integer multiple of DT_TRACER_ADVECT.
  2. There is a third, perhaps more substantial issue for all OBC tracers, that their updates happen at the beginning every dynamics time step rather than tracer time step. Not only is this unnecessary, but it also uses a pre-updated thickness for remapping external data. This issue will be addressed in a subsequent PR.
  3. Therefore, the only valid value for DT_OBC_SEG_UPDATE_OBGC is DT_TRACER_ADVECT until external data is saved in restart files. Doing so would require the model to carry two additional 4D arrays (the %t-counterpart of OBC%tres_[xy]) across restarts, which is a substantial memory cost. Given that the parameter is currently broken for all but a narrow range of values, it may be worth a broader discussion about whether to deprecate it entirely.

herrwang0 added 2 commits May 21, 2026 21:48
The time reference for CS%dt_obc_seg_time was initialized from Time
(the start of the current run segment) rather than Time_init (the
overall simulation start time).  After a restart, Time > Time_init,
so the first scheduled update could happen at a different phase than it
would in a continuous run, breaking restart reproducibility.

The fix anchors dt_obc_seg_time to Time_init so that the update
schedule is invariant to where in the simulation a restart occurs.

The old behavior is preserved by OBC_BGC_TIME_REF_BUG.
When DT_OBC_SEG_UPDATE_OBGC exceeds DT_TRACER_ADVECT, BGC OBC external
data (%t) read at initialization cannot be recovered after a restart
because it is not saved in restart files.  This means the first tracer
advection step after a restart may use incorrect boundary data.

Issue a FATAL error when this condition is detected on a restart run,
and a WARNING on a new run to alert the user that any future restart
will fail.  Both messages direct users to set DT_OBC_SEG_UPDATE_OBGC
to DT_TRACER_ADVECT or 0 until %t is added to restart files.
@herrwang0 herrwang0 added bug Something isn't working Parameter change Input parameter changes (addition, removal, or description) labels May 22, 2026
@andrew-c-ross
Copy link
Copy Markdown

Given that the parameter is currently broken for all but a narrow range of values, it may be worth a broader discussion about whether to deprecate it entirely.

We should run some timing tests, but I think deprecating and later removing the DT_OBC_SEG_UPDATE_OBGC parameter could be a good idea. This setting exists because the boundary condition data update was really expensive in early versions of the MOM6+COBALT models. However, I tracked this down to a bug in FMS and it was fixed a long time ago in NOAA-GFDL/FMS#811. It's likely that we can have the BGC tracer boundary data updated at the same time as everything else without slowing down much now.

@herrwang0
Copy link
Copy Markdown
Author

Given that the parameter is currently broken for all but a narrow range of values, it may be worth a broader discussion about whether to deprecate it entirely.

We should run some timing tests, but I think deprecating and later removing the DT_OBC_SEG_UPDATE_OBGC parameter could be a good idea. This setting exists because the boundary condition data update was really expensive in early versions of the MOM6+COBALT models. However, I tracked this down to a bug in FMS and it was fixed a long time ago in NOAA-GFDL/FMS#811. It's likely that we can have the BGC tracer boundary data updated at the same time as everything else without slowing down much now.

Thanks @andrew-c-ross for providing the context. In terms of timing, there will be a following PR, addressing the issue when the tracer data should be updated. Currently, by default, all tracer data (thermodynamics + BGC) is read and update every dynamic time step, which is high inefficient and unnecessary. For NWA12, this is twice more file read, remapping and data copy than needed.

After all of this, we will have less frequent T/S and BGC tracer updates. For NWA12, which I think is using DT_OBC_SEG_UPDATE_OBGC = DT_TRACER_ADVECT, there will probably be a modest gain of performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Parameter change Input parameter changes (addition, removal, or description)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants