Skip to content

Fix InPlaceRestartMember.setup() work_dir isolation#13

Draft
Copilot wants to merge 2 commits into
sgh_ensemble_generator_templatefrom
copilot/fix-work-dir-pickle-issue
Draft

Fix InPlaceRestartMember.setup() work_dir isolation#13
Copilot wants to merge 2 commits into
sgh_ensemble_generator_templatefrom
copilot/fix-work-dir-pickle-issue

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 1, 2026

InPlaceRestartMember.setup() must never reassign self.work_dir—if it did, compass/setup.py would write step.pickle into the spinup run directory containing config_filename = sgh_restart_ensemble.cfg, causing compass run invoked from that directory to crash with FileNotFoundError: Config file does not exist: .../spinup_ensemble/run002/sgh_restart_ensemble.cfg.

Changes

  • restart_member.pysetup() docstring: Rewritten to explicitly document that the method operates exclusively on self.spinup_run_dir (set in __init__) and never touches self.work_dir, which compass owns as the step directory. Removed a verbose inline comment made redundant by the updated docstring.
  • Verified (no change needed):
    • __init__ has no self.work_dir override
    • ensemble_manager.py uses getattr(runStep, 'spinup_run_dir', runStep.work_dir) for os.chdir()

Checklist

  • User's Guide has been updated
  • Developer's Guide has been updated
  • API documentation in the Developer's Guide (api.rst) has any new or modified class, method and/or functions listed
  • Documentation has been built locally and changes look as expected
  • The E3SM-Project submodule has been updated with relevant E3SM changes
  • The MALI-Dev submodule has been updated with relevant MALI changes
  • Document (in a comment titled Testing in this PR) any testing that was used to verify the changes
  • New tests have been added to a test suite
Original prompt

Problem

compass/setup.py sets step.work_dir to the correct compass step directory before calling step.setup(), then writes step.pickle to step.work_dir after setup() returns. However, InPlaceRestartMember.setup() reassigns self.work_dir to the spinup run directory at the very start:

# restart_member.py lines 95-96 — THIS IS THE BUG
self.work_dir = os.path.join(
    self.spinup_work_dir, f'run{self.run_num:03}')

So by the time setup.py writes step.pickle, step.work_dir points to the spinup run dir, and the pickle is written there — containing an InPlaceRestartMember with config_filename = sgh_restart_ensemble.cfg. When the spinup job_script.sh runs compass run from that directory, it loads this pickle and crashes:

FileNotFoundError: Config file does not exist:
  /pscratch/.../spinup_ensemble/run002/sgh_restart_ensemble.cfg

Root cause

PR #12 removed the self.work_dir override from __init__ but missed the identical override at the top of setup(). That override in setup() is what setup.py sees when it writes step.pickle after setup() returns.

Fix

In compass/landice/tests/ensemble_generator/sgh_restart_ensemble/restart_member.py:

  1. Delete lines 95–96 (the self.work_dir = os.path.join(...) assignment at the start of setup()).

  2. Replace all remaining uses of self.work_dir and run_dir in setup() with self.spinup_run_dir, which was added by PR Fix InPlaceRestartMember overwriting spinup job_script.sh #12 in __init__ and already holds the correct path to the original spinup run directory.

The resulting setup() should look like:

def setup(self):
    """
    Prepare the original run directory for an in-place restart.

    Uses ``self.spinup_run_dir`` (set in ``__init__``) to operate on the
    original spinup run directory without touching ``self.work_dir``, which
    compass manages as the normal step directory.

    This method:

    1. Verifies the spinup run directory and namelist.landice exist.
    2. Sets config_do_restart = .true. in namelist.landice.
    3. Creates a restart_attempt_N/ tracking directory.
    """
    run_dir = self.spinup_run_dir

    if not os.path.exists(run_dir):
        raise RuntimeError(
            f"Original run directory not found: {run_dir}")

    namelist_path = os.path.join(run_dir, 'namelist.landice')

    if not os.path.exists(namelist_path):
        raise RuntimeError(
            f"namelist.landice not found in {run_dir}")

    print(f'Setting config_do_restart = .true. in {namelist_path}')
    _set_restart_in_namelist(namelist_path)

    existing_nums = [
        int(d[len('restart_attempt_'):])
        for d in os.listdir(run_dir)
        if d.startswith('restart_attempt_') and
        d[len('restart_attempt_'):].isdigit()
    ]
    attempt_num = max(existing_nums, default=0) + 1
    attempt_dir = os.path.join(run_dir, f'restart_attempt_{attempt_num}')
    os.makedirs(attempt_dir, exist_ok=True)
    print(f'Tracking restart attempt {attempt_num} in {attempt_dir}')

Also verify that __init__ does NOT contain any self.work_dir override (it should not after PR #12 — just confirm and leave it as-is if correct).

Also verify that ensemble_manager.py uses getattr(runStep, 'spinup_run_dir', runStep.work_dir) for the os.chdir() call (added by PR #12 — just confirm and leave as-is if correct).

Current file state

  • compass/landice/tests/ensemble_generator/sgh_restart_ensemble/restart_member.py BlobSha: f13e64c3a82e5f3cc122eed7b8175ab0a556abfb

Constraints

  • PEP 8 formatting
  • No fallback= in any configparser calls
  • No timeout= in any subprocess calls

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: can you relink the agent session? It's blank

Assistant: The latest PR is #12:

data:
- url: "https://github.com/alexolinhager/compass/pull/12"
  repository: "alexolinhager/compass"
  state: "open"
  draft: true
  title: "[WIP] Fix job script overwrite in setup process"
  number: 12
  created_at: "2026-04-01"
  merged_at: ""
  labels: []
  author: "Copilot"

It's still in WIP/draft — the agent is actively working on it. The checklist shows the two tasks (fix restart_member.py and fix ensemble_manager.py) are pending. You can track progress directly at #12.

User: These fixes are still not getting to the root of the problem:
Traceback (most recent call last):
File "/global/cfs/cdirs/fanssie/users/ahager/mambaforge/envs/dev_compass_1.9.0-alpha.2/bin/compass", line 7, in
sys.exit(main())
~~~~^^
File "/global/cfs/cdirs/fanssie/users/ahager/MALIcode/compass/compass/main.py", line 63, in main
commandsargs.command
~~~~~~~~~~~~~~~~~~~~~~^^
File "/global/cfs/cdirs/fanssie/users/ahager/MALIcode/compass/compass/run/serial.py", line 206, in main
run_single_step(args.step_is_subprocess)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Fix work directory assignment in InPlaceRestartMember Fix InPlaceRestartMember.setup() work_dir isolation Apr 1, 2026
Copilot AI requested a review from alexolinhager April 1, 2026 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants