Skip to content

OBC Restarts Failing Fix #437

Draft
manishvenu wants to merge 2 commits into
NCAR:dev/ncarfrom
CROCODILE-CESM:obc_restart_fix_from_dev_ncar
Draft

OBC Restarts Failing Fix #437
manishvenu wants to merge 2 commits into
NCAR:dev/ncarfrom
CROCODILE-CESM:obc_restart_fix_from_dev_ncar

Conversation

@manishvenu

@manishvenu manishvenu commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Details:

When OBCs are on, we are failing restart tests (Tests here: ESCOMP/MOM_interface#315). When OBCs are OFF (OBC number of segments are zero) it passes.

Reproduce:

  1. I'm using a hodgepodge of branches and checkouts required to run regional MOM6 inside the CESM, which is mostly captured here: https://github.com/CROCODILE-CESM/CESM/tree/full_regional_cesm

  2. Run a simple ERS test:

qcmd -A NCGD0011 -- /glade/work/manishrv/installs/cesm3_maddd_new/cime/scripts/create_test --test-root /glade/derecho/scratch/manishrv/tests/regional/reg_ers --generate /glade/derecho/scratch/manishrv/baselines/regional ERS_Ld9.USER_RES.CR_JRA.derecho_intel.mom-regional-base -o --no-run

Root Cause:

With Claude's help and a lot of iteration, the issue seems to be that across an exact restart, the OBC-exterior h halo (the ghost ring just outside open-boundary segments) is not reconstructed bit-for-bit. The interior thickness restarts perfectly, but the exterior ghost cells come back different: the restart file stores only the computational domain, and the existing thickness-reservoir restart path (h_res_x/h_res_y, gated by use_h_res) does not restore the value the friction solve needs.

That stale ghost h is consumed by vertvisc's DIRECT_STRESS body force (MOM_vert_friction.F90, h_a = 0.5*(h(i,j,k)+h(i+1,j,k)), ~L698 u / L918 v), which spreads wind stress over the top HMIX_STRESS of fluid and reads the ghost cell with no OBC projection — perturbing the surface velocity and breaking restart. (vertvisc_coef reads the ghost too but already projects it away with a zero-gradient condition, so its coefficients stay bit-perfect; the .z. remapped diagnostics also read it.)

The fix:
There are three, the first of which is a workaround.

  1. DIRECT_STRESS = False. This is the only consumer that was using the halo ghost ring without shielding, restarts still fail because the z-remapped diagnostics (the ".z." files) seem to use it as well, but the model itself restarts perfectly.
  2. This PR! We set the halo to be the exact same thickness as the first thickness within computation domain. This is not without merit, in vertvisc_coef, there is a similar protection but it's only for that routine
  3. Same as 1), a fix in direct stress: Applying wind stresses at OBC boundaries ACCESS-NRI/MOM6#54, so you still get the minor z file differences.

Across an exact restart, the OBC-exterior h halo (the ghost ring just
outside open-boundary segments) is not reconstructed bit-for-bit. The
interior thickness restarts perfectly, but the exterior ghost cells come
back slightly different (~cm-scale, not roundoff).

vertvisc_coef projects thickness outward across OBCs using a zero-gradient
condition and reads this exterior h/dz; a stale/inconsistent exterior value
perturbs the boundary-adjacent coupling coefficient, so vertvisc returns a
last-bit-different velocity that grows over the run and breaks the CIME ERS
exact-restart test (COMPARE_base_rest). The same configuration with
OBC_NUMBER_OF_SEGMENTS=0 restarts bit-for-bit.

Rebuild the exterior h with a zero-gradient copy from the first interior
column/row at the top of step_MOM_dyn_split_RK2, right after
update_OBC_ramp. Two passes (E/W exterior columns first, then N/S exterior
rows) so diagonal corner cells where two segments meet trace back to the
bit-perfect interior h rather than a stale exterior cell.

Verified on a regional CESM G-compset case (ERS_Ld9, 4 Flather/Orlanski/
nudged OBC segments): COMPARE_base_rest now passes bit-for-bit, and the
physical/active diagnostic fields are identical across the continuous and
restarted legs.
…ient fill

Follow-up bisection on the ERS_Ld9 regional case showed the original
two-pass, corner-tracing reconstruction was more than required:

- vertvisc_coef reads only same-row (u-point) / same-column (v-point)
  exterior neighbours, never the diagonal corner, so corner cells can be
  left stale and the case still restarts bit-for-bit. Verified with a
  frozen-copy variant that deliberately staled the corners (PASS).
- The leak is independent of HARMONIC_VISC (both the harmonic and
  arithmetic averaging branches diverge), so it is in code shared by both
  paths, not the branch-specific averaging.
- Both E/W and N/S boundaries contribute; reconstructing only E/W halves
  the divergence but does not remove it.

Replace the two sequential passes with a single in-place loop over all
segments (copy the first interior thickness into the exterior ghost
cell). ERS_Ld9 with 4 Flather/Orlanski/nudged OBC segments passes
COMPARE_base_rest bit-for-bit.
@manishvenu manishvenu requested a review from alperaltuntas June 3, 2026 21:54
@manishvenu

Copy link
Copy Markdown
Collaborator Author

Following up, this issue might be very related: ACCESS-NRI#54!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant