Skip to content

Simplify sgh_restart_ensemble: require analysis_summary_file, remove _should_restart_run and glob scanning#7

Draft
Copilot wants to merge 2 commits into
sgh_ensemble_generator_templatefrom
copilot/simplify-sgh-restart-ensemble-test-case
Draft

Simplify sgh_restart_ensemble: require analysis_summary_file, remove _should_restart_run and glob scanning#7
Copilot wants to merge 2 commits into
sgh_ensemble_generator_templatefrom
copilot/simplify-sgh-restart-ensemble-test-case

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 31, 2026

sgh_restart_ensemble/test_case.py had a broken configparser API usage (config.get('restart_ensemble', {}) crashes with AttributeError), relied on per-run analysis_results.json files that don't exist in the current workflow, and used glob to discover run directories instead of reading the canonical list from the ensemble analysis output.

Changes

  • analysis_summary_file is now required in [restart_ensemble]ValueError raised if missing from config or file not found on disk
  • Run list sourced from restart_needed_runs in analysis_summary_file JSON — eliminates glob directory scanning entirely
  • _should_restart_run() deleted — no per-run analysis_results.json logic remains anywhere
  • min_simulation_years_before_restart removed — no longer needed
  • Configparser API fixed — replaced broken section = config.get('restart_ensemble', {}) pattern with proper config.get(section, option) / config.getint(...) / config.getboolean(...) calls with try/except fallbacks for optional options
  • import compass.namelist moved to top-level imports

Required .cfg config

[restart_ensemble]
spinup_work_dir = /path/to/spinup/ensemble
analysis_summary_file = /path/to/analysis_summary.json  # now required
max_consecutive_restarts = 3        # optional, default 3
auto_restart_incomplete = True      # optional, default True

Checklist

  • User's Guide has been updated
  • Developer's Guide has been updated
  • API documentation in the Developer's Guide (api.rst) has any new or modified class, method and/or functions listed
  • Documentation has been built locally and changes look as expected
  • The E3SM-Project submodule has been updated with relevant E3SM changes
  • The MALI-Dev submodule has been updated with relevant MALI changes
  • Document (in a comment titled Testing in this PR) any testing that was used to verify the changes
  • New tests have been added to a test suite
Original prompt

Goal

Simplify sgh_restart_ensemble/test_case.py so that:

  1. analysis_summary_file is a required config option in [restart_ensemble] — raise ValueError if it is missing or the file doesn't exist.
  2. The run list comes directly from restart_needed_runs in that JSON — no directory scanning with glob.
  3. _should_restart_run() is completely removed. There is no per-run analysis_results.json check — ever.
  4. For each run number in restart_needed_runs, configure() only checks:
    • The original run directory (spinup_work_dir/runXXX) exists
    • restart_timestamp exists (i.e. the run actually started)
    • The run has NOT already reached stop time (compare restart_timestamp contents to config_stop_time in namelist.landice)
    • The run has NOT already exceeded max_consecutive_restarts (count restart_attempt_* subdirs — keep this for safety but it's secondary)
    • auto_restart_incomplete is True
  5. Fix the configparser API bug: replace config.get('restart_ensemble', {}) with proper config.get(section, option) / config.getint(...) / config.getfloat(...) / config.getboolean(...) calls with try/except fallbacks for optional options.

Resulting configure() logic

def configure(self):
    config = self.config

    # Required: spinup_work_dir
    try:
        spinup_work_dir = config.get('restart_ensemble', 'spinup_work_dir')
    except Exception:
        raise ValueError(
            "restart_ensemble config must specify spinup_work_dir\n"
            "[restart_ensemble]\n"
            "spinup_work_dir = /path/to/spinup/ensemble"
        )
    if not os.path.exists(spinup_work_dir):
        raise ValueError(f"spinup_work_dir not found: {spinup_work_dir}")

    # Required: analysis_summary_file
    try:
        analysis_summary_file = config.get('restart_ensemble',
                                           'analysis_summary_file')
    except Exception:
        raise ValueError(
            "restart_ensemble config must specify analysis_summary_file\n"
            "[restart_ensemble]\n"
            "analysis_summary_file = /path/to/analysis_summary.json"
        )
    if not os.path.exists(analysis_summary_file):
        raise ValueError(
            f"analysis_summary_file not found: {analysis_summary_file}")

    # Load restart candidates from summary
    with open(analysis_summary_file, 'r') as f:
        summary = json.load(f)
    restart_needed_runs = summary.get('restart_needed_runs', [])
    print(f"Found {len(restart_needed_runs)} restart candidates in "
          f"{analysis_summary_file}")

    # Optional config
    try:
        max_consecutive_restarts = config.getint(
            'restart_ensemble', 'max_consecutive_restarts')
    except Exception:
        max_consecutive_restarts = 3

    try:
        auto_restart = config.getboolean(
            'restart_ensemble', 'auto_restart_incomplete')
    except Exception:
        auto_restart = True

    restart_runs = []
    skipped_runs = []

    for run_num in restart_needed_runs:
        run_name = f'run{run_num:03}'
        run_dir = os.path.join(spinup_work_dir, run_name)

        if not os.path.exists(run_dir):
            skipped_runs.append((run_num, "Run directory not found"))
            continue

        # Check restart_timestamp exists
        restart_timestamp_file = os.path.join(run_dir, 'restart_timestamp')
        if not os.path.exists(restart_timestamp_file):
            skipped_runs.append(
                (run_num, "No restart_timestamp (run may have failed)"))
            continue

        # Check not already completed
        try:
            with open(restart_timestamp_file, 'r') as f:
                current_time = f.read().strip()
            import compass.namelist
            namelist = compass.namelist.ingest(
                os.path.join(run_dir, 'namelist.landice'))
            stop_time = (namelist['time_management']['config_stop_time']
                         .strip().strip("'"))
            if current_time == stop_time:
                skipped_runs.append((run_num, "Already completed"))
                continue
        except Exception as e:
            skipped_runs.append(
                (run_num, f"Error reading completion status: {e}"))
            continue

        # Check max restart attempts
        restart_dirs = [d for d in os.listdir(run_dir)
                        if d.startswith('restart_attempt_')]
        if len(restart_dirs) >= max_consecutive_restarts:
            skipped_runs.append(
                (run_num,
                 f"Max restart attempts reached "
                 f"({len(restart_dirs)}/{max_consecutive_restarts})"))
            continue

        if not auto_restart:
            skipped_runs.append((run_num, "Auto-restart disabled"))
            continue

        restart_runs.append(run_num)
        print(f"Scheduling restart for {run_name}")
        self.add_step(InPlaceRestartMember(
            test_case=self,
           ...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

…emove _should_restart_run and glob

Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/fb97e0bd-5318-4dc9-883f-17d2a4bb05ea

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
Copilot AI changed the title [WIP] Simplify analysis configuration and run checks in test_case.py Simplify sgh_restart_ensemble: require analysis_summary_file, remove _should_restart_run and glob scanning Mar 31, 2026
Copilot AI requested a review from alexolinhager March 31, 2026 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants