Replace RestartMember with InPlaceRestartMember for in-place SGH ensemble restarts#1
Draft
Copilot wants to merge 3 commits into
Draft
Conversation
… restarts Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/0e493fcc-c31f-4557-9865-eca1c04394c4 Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Replace restart member step with in-place restart
Replace RestartMember with InPlaceRestartMember for in-place SGH ensemble restarts
Mar 26, 2026
cefe523 to
291d679
Compare
Copilot AI
added a commit
that referenced
this pull request
Apr 1, 2026
Bug #1: config.get('restart_ensemble', {}) crashes because MpasConfigParser.get() expects (section, option) positional args, not a dict fallback. Fixed: config['restart_ensemble'] returns a SectionProxy with proper .get()/.getint()/.getfloat()/.getboolean() methods. Bug #2: _should_restart_run() looked for per-run analysis_results.json files that are never written. AnalysisStep writes analysis_summary.json to its own work dir containing an individual_results dict for all runs. Fixed: add analysis_summary_file config option; configure() loads the file and passes per-run dicts to _should_restart_run() via a new run_results param. RestartScheduler.create_config_file() now includes analysis_summary_file in generated configs. Bug #3: restart_attempt_N/ tracking dirs were never created by InPlaceRestartMember.setup(), so max_consecutive_restarts was effectively disabled and all attempt counters read 0. Fixed: setup() now creates restart_attempt_N/ dirs using a single os.listdir() call to find the highest existing attempt number. Bug #5: restart_scheduler.py docstring Examples section referenced a non-existent module path. Fixed to the correct path. Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/ca2d29bf-1246-415c-bf2c-9de7521fa55f Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
This was referenced Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Restarts for incomplete SGH ensemble runs now happen in the original run directory instead of copying files into
restart_attempt_N/subdirectories.Changes
restart_member.py— Full replacementRestartMember(file copies, subdirectory creation,run_modelcall)InPlaceRestartMemberwith:__init__()overridesself.work_dirto point atspinup_work_dir/runXXX—EnsembleManagerthen callssbatch job_script.shfrom the correct location automaticallysetup()editsnamelist.landicein-place via_set_restart_in_namelist(), flippingconfig_do_restartto.true.run()method — job submission delegated entirely toEnsembleManager_set_restart_in_namelist(namelist_path)helper (find-and-replace onconfig_do_restartline; appends if absent)test_case.py— Minimal updateRestartMember→InPlaceRestartMember_should_restart_runlogic, config reading, and skipped-run reporting unchanged__init__.py— No changes neededChecklist
api.rst) has any new or modified class, method and/or functions listedE3SM-Projectsubmodule has been updated with relevant E3SM changesMALI-Devsubmodule has been updated with relevant MALI changesTestingin this PR) any testing that was used to verify the changesOriginal prompt
Background
The current
sgh_restart_ensembleworkflow (incompass/landice/tests/ensemble_generator/sgh_restart_ensemble/) restarts incomplete runs by copying files into new subdirectories (restart_attempt_N/) inside a separate restart work directory. The desired behavior is simpler: restart each flagged run in its original run directory by:config_do_restart = .true.in the existingnamelist.landicein-placejob_script.shfrom that same original directory viasbatchNo files should be copied. No new subdirectories should be created.
Files to change
All files are on the
sgh_ensemble_generator_templatebranch under:compass/landice/tests/ensemble_generator/sgh_restart_ensemble/1.
restart_member.py— Replace entirelyThe current
RestartMemberstep copies files intorestart_attempt_N/subdirectories. Replace the entire file with a newInPlaceRestartMemberstep that:run_numandspinup_work_diras beforesetup():os.path.join(spinup_work_dir, f'run{run_num:03}')namelist.landicein-place in that original directory, changingconfig_do_restartto.true.(the line currently readsconfig_do_restart = .false.or may already be.true.)run()method that callsrun_model— job submission is handled byEnsembleManager(viasbatch job_script.sh)The class should be named
InPlaceRestartMember(keep the import intest_case.pyupdated accordingly).Here is the helper function that should be used to edit
namelist.landicein-place (find-and-replace theconfig_do_restartline):2.
test_case.py— Update to use in-place restartThe
RestartEnsemble.configure()method currently createsRestartMembersteps (which set up new subdirectories). Change it to createInPlaceRestartMembersteps instead.Key changes:
from .restart_member import RestartMember→from .restart_member import InPlaceRestartMemberconfigure(), when a run is flagged for restart, call:RestartMember(...).Everything else in
test_case.py— the_should_restart_runlogic, the config reading, the skipped runs reporting — stays the same.3.
__init__.py— Update exportsChange the import from:
to:
(No change needed here unless the class name is referenced — just make sure exports still work.)
Important context
The
EnsembleManager.run()(inensemble_manager.py, not in this package) already handles iterating over all steps and callingsbatch job_script.shin each step'swork_dir. For in-place restarts, theInPlaceRestartMemberstep'swork_dirshould point to the original run directory (i.e.,spinup_work_dir/runXXX), so thatEnsembleManagersubmits the job from the right place.To make this work, override
work_dirinInPlaceRestartMember.__init__()after callingsuper().__init__():This is the key mechanism that makes resubmission happen in-place.
The
sgh_ensemble_analysisstep (inanalysis_step.py) writesanalysis_summary.jsonto itswork_dir. Therestart_needed_runslist in that JSON contains run numbers (integers).RestartEnsemble.configure()readsspinup_work_dirfrom config and scansspinup_work_dir/run*directories — this logic is correct and should be preserved.The
restart_scheduler.pyfile does not need any changes.The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Within this branch: @alexolinhager/compass/files/compass/landice/tests/ens...
This pull request was created from Copilot chat.
📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.