Skip to content

Replace RestartMember with InPlaceRestartMember for in-place SGH ensemble restarts#1

Draft
Copilot wants to merge 3 commits into
sgh_ensemble_generator_templatefrom
copilot/replace-restart-member-to-inplace
Draft

Replace RestartMember with InPlaceRestartMember for in-place SGH ensemble restarts#1
Copilot wants to merge 3 commits into
sgh_ensemble_generator_templatefrom
copilot/replace-restart-member-to-inplace

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 26, 2026

Restarts for incomplete SGH ensemble runs now happen in the original run directory instead of copying files into restart_attempt_N/ subdirectories.

Changes

restart_member.py — Full replacement

  • Removes RestartMember (file copies, subdirectory creation, run_model call)
  • Adds InPlaceRestartMember with:
    • __init__() overrides self.work_dir to point at spinup_work_dir/runXXXEnsembleManager then calls sbatch job_script.sh from the correct location automatically
    • setup() edits namelist.landice in-place via _set_restart_in_namelist(), flipping config_do_restart to .true.
    • No run() method — job submission delegated entirely to EnsembleManager
  • Adds _set_restart_in_namelist(namelist_path) helper (find-and-replace on config_do_restart line; appends if absent)

test_case.py — Minimal update

  • Import and instantiation updated: RestartMemberInPlaceRestartMember
  • All _should_restart_run logic, config reading, and skipped-run reporting unchanged

__init__.py — No changes needed

# Before: copies files into restart_attempt_N/, calls run_model()
self.add_step(RestartMember(test_case=self, run_num=run_num, spinup_work_dir=spinup_work_dir))

# After: edits namelist in-place, work_dir points to original run dir
self.add_step(InPlaceRestartMember(test_case=self, run_num=run_num, spinup_work_dir=spinup_work_dir))

Checklist

  • User's Guide has been updated
  • Developer's Guide has been updated
  • API documentation in the Developer's Guide (api.rst) has any new or modified class, method and/or functions listed
  • Documentation has been built locally and changes look as expected
  • The E3SM-Project submodule has been updated with relevant E3SM changes
  • The MALI-Dev submodule has been updated with relevant MALI changes
  • Document (in a comment titled Testing in this PR) any testing that was used to verify the changes
  • New tests have been added to a test suite
Original prompt

Background

The current sgh_restart_ensemble workflow (in compass/landice/tests/ensemble_generator/sgh_restart_ensemble/) restarts incomplete runs by copying files into new subdirectories (restart_attempt_N/) inside a separate restart work directory. The desired behavior is simpler: restart each flagged run in its original run directory by:

  1. Setting config_do_restart = .true. in the existing namelist.landice in-place
  2. Resubmitting job_script.sh from that same original directory via sbatch

No files should be copied. No new subdirectories should be created.


Files to change

All files are on the sgh_ensemble_generator_template branch under:
compass/landice/tests/ensemble_generator/sgh_restart_ensemble/

1. restart_member.py — Replace entirely

The current RestartMember step copies files into restart_attempt_N/ subdirectories. Replace the entire file with a new InPlaceRestartMember step that:

  • Takes run_num and spinup_work_dir as before
  • In setup():
    • Resolves the original run directory: os.path.join(spinup_work_dir, f'run{run_num:03}')
    • Edits namelist.landice in-place in that original directory, changing config_do_restart to .true. (the line currently reads config_do_restart = .false. or may already be .true.)
    • Does NOT copy any files, does NOT create any new subdirectories
  • Has no run() method that calls run_model — job submission is handled by EnsembleManager (via sbatch job_script.sh)

The class should be named InPlaceRestartMember (keep the import in test_case.py updated accordingly).

Here is the helper function that should be used to edit namelist.landice in-place (find-and-replace the config_do_restart line):

def _set_restart_in_namelist(namelist_path):
    """Set config_do_restart = .true. in-place in namelist.landice."""
    with open(namelist_path, 'r') as f:
        lines = f.readlines()

    updated = False
    new_lines = []
    for line in lines:
        stripped = line.strip()
        if stripped.startswith('config_do_restart'):
            # Replace whatever value it has with .true.
            new_lines.append(
                line[:line.index('config_do_restart')]
                + 'config_do_restart = .true.\n'
            )
            updated = True
        else:
            new_lines.append(line)

    if not updated:
        # Append if not present
        new_lines.append('\nconfig_do_restart = .true.\n')

    with open(namelist_path, 'w') as f:
        f.writelines(new_lines)

2. test_case.py — Update to use in-place restart

The RestartEnsemble.configure() method currently creates RestartMember steps (which set up new subdirectories). Change it to create InPlaceRestartMember steps instead.

Key changes:

  • Change from .restart_member import RestartMemberfrom .restart_member import InPlaceRestartMember
  • In configure(), when a run is flagged for restart, call:
    self.add_step(InPlaceRestartMember(
        test_case=self,
        run_num=run_num,
        spinup_work_dir=spinup_work_dir
    ))
    instead of RestartMember(...).

Everything else in test_case.py — the _should_restart_run logic, the config reading, the skipped runs reporting — stays the same.

3. __init__.py — Update exports

Change the import from:

from .restart_scheduler import RestartScheduler
from .test_case import RestartEnsemble

to:

from .restart_scheduler import RestartScheduler
from .test_case import RestartEnsemble

(No change needed here unless the class name is referenced — just make sure exports still work.)


Important context

  • The EnsembleManager.run() (in ensemble_manager.py, not in this package) already handles iterating over all steps and calling sbatch job_script.sh in each step's work_dir. For in-place restarts, the InPlaceRestartMember step's work_dir should point to the original run directory (i.e., spinup_work_dir/runXXX), so that EnsembleManager submits the job from the right place.

    To make this work, override work_dir in InPlaceRestartMember.__init__() after calling super().__init__():

    # Override work_dir to point to the original run directory
    self.work_dir = os.path.join(spinup_work_dir, f'run{run_num:03}')

    This is the key mechanism that makes resubmission happen in-place.

  • The sgh_ensemble_analysis step (in analysis_step.py) writes analysis_summary.json to its work_dir. The restart_needed_runs list in that JSON contains run numbers (integers). RestartEnsemble.configure() reads spinup_work_dir from config and scans spinup_work_dir/run* directories — this logic is correct and should be preserved.

  • The restart_scheduler.py file does not need any changes.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Within this branch: @alexolinhager/compass/files/compass/landice/tests/ens...

This pull request was created from Copilot chat.


📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

… restarts

Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/0e493fcc-c31f-4557-9865-eca1c04394c4

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
Copilot AI changed the title [WIP] Replace restart member step with in-place restart Replace RestartMember with InPlaceRestartMember for in-place SGH ensemble restarts Mar 26, 2026
Copilot AI requested a review from alexolinhager March 26, 2026 22:28
@alexolinhager alexolinhager force-pushed the sgh_ensemble_generator_template branch from cefe523 to 291d679 Compare March 31, 2026 20:11
Copilot AI added a commit that referenced this pull request Apr 1, 2026
Bug #1: config.get('restart_ensemble', {}) crashes because MpasConfigParser.get()
expects (section, option) positional args, not a dict fallback.
Fixed: config['restart_ensemble'] returns a SectionProxy with proper
.get()/.getint()/.getfloat()/.getboolean() methods.

Bug #2: _should_restart_run() looked for per-run analysis_results.json files
that are never written.  AnalysisStep writes analysis_summary.json to its
own work dir containing an individual_results dict for all runs.
Fixed: add analysis_summary_file config option; configure() loads the file
and passes per-run dicts to _should_restart_run() via a new run_results param.
RestartScheduler.create_config_file() now includes analysis_summary_file in
generated configs.

Bug #3: restart_attempt_N/ tracking dirs were never created by
InPlaceRestartMember.setup(), so max_consecutive_restarts was effectively
disabled and all attempt counters read 0.
Fixed: setup() now creates restart_attempt_N/ dirs using a single os.listdir()
call to find the highest existing attempt number.

Bug #5: restart_scheduler.py docstring Examples section referenced a
non-existent module path. Fixed to the correct path.

Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/ca2d29bf-1246-415c-bf2c-9de7521fa55f

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants