OBC Restarts Failing Fix #437
Draft
manishvenu wants to merge 2 commits into
Draft
Conversation
Across an exact restart, the OBC-exterior h halo (the ghost ring just outside open-boundary segments) is not reconstructed bit-for-bit. The interior thickness restarts perfectly, but the exterior ghost cells come back slightly different (~cm-scale, not roundoff). vertvisc_coef projects thickness outward across OBCs using a zero-gradient condition and reads this exterior h/dz; a stale/inconsistent exterior value perturbs the boundary-adjacent coupling coefficient, so vertvisc returns a last-bit-different velocity that grows over the run and breaks the CIME ERS exact-restart test (COMPARE_base_rest). The same configuration with OBC_NUMBER_OF_SEGMENTS=0 restarts bit-for-bit. Rebuild the exterior h with a zero-gradient copy from the first interior column/row at the top of step_MOM_dyn_split_RK2, right after update_OBC_ramp. Two passes (E/W exterior columns first, then N/S exterior rows) so diagonal corner cells where two segments meet trace back to the bit-perfect interior h rather than a stale exterior cell. Verified on a regional CESM G-compset case (ERS_Ld9, 4 Flather/Orlanski/ nudged OBC segments): COMPARE_base_rest now passes bit-for-bit, and the physical/active diagnostic fields are identical across the continuous and restarted legs.
…ient fill Follow-up bisection on the ERS_Ld9 regional case showed the original two-pass, corner-tracing reconstruction was more than required: - vertvisc_coef reads only same-row (u-point) / same-column (v-point) exterior neighbours, never the diagonal corner, so corner cells can be left stale and the case still restarts bit-for-bit. Verified with a frozen-copy variant that deliberately staled the corners (PASS). - The leak is independent of HARMONIC_VISC (both the harmonic and arithmetic averaging branches diverge), so it is in code shared by both paths, not the branch-specific averaging. - Both E/W and N/S boundaries contribute; reconstructing only E/W halves the divergence but does not remove it. Replace the two sequential passes with a single in-place loop over all segments (copy the first interior thickness into the exterior ghost cell). ERS_Ld9 with 4 Flather/Orlanski/nudged OBC segments passes COMPARE_base_rest bit-for-bit.
Collaborator
Author
|
Following up, this issue might be very related: ACCESS-NRI#54!! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details:
When OBCs are on, we are failing restart tests (Tests here: ESCOMP/MOM_interface#315). When OBCs are OFF (OBC number of segments are zero) it passes.
Reproduce:
I'm using a hodgepodge of branches and checkouts required to run regional MOM6 inside the CESM, which is mostly captured here: https://github.com/CROCODILE-CESM/CESM/tree/full_regional_cesm
Run a simple ERS test:
Root Cause:
With Claude's help and a lot of iteration, the issue seems to be that across an exact restart, the OBC-exterior h halo (the ghost ring just outside open-boundary segments) is not reconstructed bit-for-bit. The interior thickness restarts perfectly, but the exterior ghost cells come back different: the restart file stores only the computational domain, and the existing thickness-reservoir restart path (h_res_x/h_res_y, gated by use_h_res) does not restore the value the friction solve needs.
That stale ghost h is consumed by vertvisc's DIRECT_STRESS body force (MOM_vert_friction.F90, h_a = 0.5*(h(i,j,k)+h(i+1,j,k)), ~L698 u / L918 v), which spreads wind stress over the top HMIX_STRESS of fluid and reads the ghost cell with no OBC projection — perturbing the surface velocity and breaking restart. (vertvisc_coef reads the ghost too but already projects it away with a zero-gradient condition, so its coefficients stay bit-perfect; the .z. remapped diagnostics also read it.)
The fix:
There are three, the first of which is a workaround.