Skip to content

Fix bug in restarts when using Runge-Kutta time integration#172

Merged
trhille merged 1 commit into
MALI-Dev:developfrom
trhille:mali/fix_rk_restarts
Apr 7, 2026
Merged

Fix bug in restarts when using Runge-Kutta time integration#172
trhille merged 1 commit into
MALI-Dev:developfrom
trhille:mali/fix_rk_restarts

Conversation

@trhille
Copy link
Copy Markdown

@trhille trhille commented Apr 6, 2026

Fix restarts when using Runge-Kutta time integration. Recent changes introduced in MALI-Dev PR #166 broke bit-for-bit restarts when using RK integration. These changes restore BFB restarts with RK.

@trhille
Copy link
Copy Markdown
Author

trhille commented Apr 6, 2026

Testing

Using fct_integration suite with RK2: MPAS-Dev/compass#650 at commit a49657f.
When MALI is checked out at 3909877 (merge commit before #166), all restart tests pass:

Test Runtimes:
00:42 PASS landice_dome_2000m_fo_fct_decomposition_test
00:23 PASS landice_dome_2000m_fo_fct_restart_test
00:14 FAIL landice_dome_variable_resolution_fo_fct_decomposition_test
00:14 PASS landice_dome_variable_resolution_fo_fct_restart_test
01:13 PASS landice_greenland_fo_fct_decomposition_test
01:14 PASS landice_greenland_fo_fct_restart_test
00:30 FAIL landice_thwaites_fct_fo_decomposition_test
00:36 PASS landice_thwaites_fct_fo_restart_test
01:18 PASS landice_humboldt_mesh-3km_restart_test_velo-fo_advec-fct_calving-von_mises_stress_damage-threshold_faceMelting
Total runtime 06:49

(The failed decomposition tests are actually fine, we just need to adjust norm tolerances slightly).

When checked out at f5e9f8b (merge commit for #166), many but not all restart tests fail:

Test Runtimes:
00:38 PASS landice_dome_2000m_fo_fct_decomposition_test
00:34 FAIL landice_dome_2000m_fo_fct_restart_test
00:16 PASS landice_dome_variable_resolution_fo_fct_decomposition_test
00:17 FAIL landice_dome_variable_resolution_fo_fct_restart_test
01:18 PASS landice_greenland_fo_fct_decomposition_test
01:10 FAIL landice_greenland_fo_fct_restart_test
00:27 FAIL landice_thwaites_fct_fo_decomposition_test
00:39 PASS landice_thwaites_fct_fo_restart_test
01:12 PASS landice_humboldt_mesh-3km_restart_test_velo-fo_advec-fct_calving-von_mises_stress_damage-threshold_faceMelting
Total runtime 07:03

When using these changes, all restart tests pass:

Test Runtimes:
00:42 PASS landice_dome_2000m_fo_fct_decomposition_test
00:31 PASS landice_dome_2000m_fo_fct_restart_test
00:16 PASS landice_dome_variable_resolution_fo_fct_decomposition_test
00:15 PASS landice_dome_variable_resolution_fo_fct_restart_test
01:17 PASS landice_greenland_fo_fct_decomposition_test
01:06 PASS landice_greenland_fo_fct_restart_test
00:26 FAIL landice_thwaites_fct_fo_decomposition_test
00:37 PASS landice_thwaites_fct_fo_restart_test
01:12 PASS landice_humboldt_mesh-3km_restart_test_velo-fo_advec-fct_calving-von_mises_stress_damage-threshold_faceMelting
Total runtime 06:53

Copy link
Copy Markdown

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trhille , thanks for hunting this down. I agree this is quite confusing to think through, but I think I follow it. I believe it relates to this code, which was added to prevent the Albany mesh from being generated unnecessarily:
https://github.com/MALI-Dev/E3SM/blob/develop/components/mpas-albany-landice/src/mode_forward/mpas_li_velocity_external.F#L541-L552
Maybe it's worth adding a comment where the new logic is added pointing to this block of code as being relevant.

If you want to look into this further, you can try temporarily changing the default in Registry for config_always_compute_fem_grid to .true. and the rerunning the test suite without this branch. I think it should then work.

@matthewhoffman matthewhoffman added the bug Something isn't working label Apr 6, 2026
@mperego
Copy link
Copy Markdown

mperego commented Apr 6, 2026

@matthewhoffman, that makes sense. Basically the fix is ensuring that anyDynamicVertexMaskChanged and dirichletMaskChanged are false. I guess that, in a restart run, vertexMaskOld would be set to zero in the first step, so without the fix, anyDynamicVertexMaskChanged would be true during the RK stages, and Albany would recompute the mesh.
However, even if the mesh is recomputed, it should be exactly the same mesh as before, because the masks in the RK algorithms are set to the previous/initial value. So, I'm worried that there still are some inconsistencies, either in the MPAS/RK code (e.g. not all the masks are properly reset) or in the MPAS-Albany interface.
Also, this fix is a bit ugly, and it might create unintended consequences. Maybe the check on whether to recompute the Albany mesh should be done upstream, in li_velocity_solve, so that we can avoid recomputing the mesh when updateMask is false, and then we pass a flag to li_velocity_external_solve indicating whether the mesh should be recomputed or not.

@trhille
Copy link
Copy Markdown
Author

trhille commented Apr 6, 2026

@matthewhoffman @mperego, what is confusing is that config_always_compute_fem_grid defaults to .true., and I have not changed this for any of my testing. vertexMaskOld seems to only be used to calculate anyDynamicVertexMaskChanged, which is used in this block for determining whether to calculate the fem grid: https://github.com/MALI-Dev/E3SM/blob/develop/components/mpas-albany-landice/src/mode_forward/mpas_li_velocity_external.F#L541-L552

But the value of anyDynamicVertexMaskChanged is irrelevant when config_always_compute_fem_grid is .true., as it is here.

I ran a test with this line commented out to test whether the relevant change is in fact the reset of dirichletVelocityMaskOld: https://github.com/MALI-Dev/E3SM/pull/172/changes#diff-485e1dd5a3e13599de546e61aa267f12e9aff2a23dca30ae09ce02ddba22cbc5R558. The restart tests fail when I do that. So it really is the resetting of vertexMaskOld that is having an impact here. I just can't understand why.

…MaskPrev

Remove inadvertent misalignment between the final mask update before
the RK loop and setting the baseline values for the masks.
@trhille trhille force-pushed the mali/fix_rk_restarts branch from 1ed7857 to 759b5e2 Compare April 6, 2026 19:42
@trhille
Copy link
Copy Markdown
Author

trhille commented Apr 6, 2026

@mperego @matthewhoffman, Okay I figured out the root cause and the fix is much simpler than what co-pilot was doing. I'm force-pushing because that old commit is no longer needed.

Restarts are now BFB with this change:

Test Runtimes:
00:27 PASS landice_dome_2000m_fo_fct_decomposition_test
00:22 PASS landice_dome_2000m_fo_fct_restart_test
00:13 FAIL landice_dome_variable_resolution_fo_fct_decomposition_test
00:13 PASS landice_dome_variable_resolution_fo_fct_restart_test
01:22 PASS landice_greenland_fo_fct_decomposition_test
01:07 PASS landice_greenland_fo_fct_restart_test
00:24 FAIL landice_thwaites_fct_fo_decomposition_test
00:36 PASS landice_thwaites_fct_fo_restart_test
00:41 PASS landice_humboldt_mesh-3km_restart_test_velo-fo_advec-fct_calving-von_mises_stress_damage-threshold_faceMelting
Total runtime 05:41
FAIL: 2 tests failed, see above.

@trhille trhille requested a review from matthewhoffman April 6, 2026 19:48
@mperego
Copy link
Copy Markdown

mperego commented Apr 6, 2026

@trhille This looks more reasonable. I don't understand the logic and why the CoPilot fis was working, but I can see that delaying updating the masks could be a issue on restart.

@matthewhoffman
Copy link
Copy Markdown

matthewhoffman commented Apr 6, 2026

@trhille , is this the line that is affected by moving the mask call earlier?

image

I don't see anything else that would be affected.

@trhille
Copy link
Copy Markdown
Author

trhille commented Apr 6, 2026

@matthewhoffman edgeMask and vertexMask are also updated by li_calculate_mask. Based on the results of the previous fix suggested by co-pilot, the issue was probably that vertexMask is not generally initialized after a restart by this point in the code. (This perhaps is depending on which physics are active; if face-melting is on then I think we will already have an updated vertexMask by this point, which could explain why not all restart tests were failing in fct_integration.) So previously we were setting vertexMaskPrev based on garbage values in vertexMask, repeatedly overwriting vertexMask with the bad values in vertexMaskPrev, and then passing those to the fem grid calculation.

@trhille
Copy link
Copy Markdown
Author

trhille commented Apr 6, 2026

With these changes, all full_integration tests pass validation and baseline comparison.

Copy link
Copy Markdown

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trhille , thanks for your diligence in getting to the bottom of this one. This solution is simple and makes sense.

@trhille trhille merged commit fab1754 into MALI-Dev:develop Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants