Skip to content

Fix: InPlaceRestartMember no longer overwrites original job_script.sh#11

Merged
alexolinhager merged 2 commits into
sgh_ensemble_generator_templatefrom
copilot/fix-job-script-overwrite
Apr 1, 2026
Merged

Fix: InPlaceRestartMember no longer overwrites original job_script.sh#11
alexolinhager merged 2 commits into
sgh_ensemble_generator_templatefrom
copilot/fix-job-script-overwrite

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 1, 2026

InPlaceRestartMember.setup() was overwriting the spinup run directory's job_script.sh with a new script invoking compass run. At runtime on the compute node, compass run looked for sgh_restart_ensemble.cfg relative to the spinup run dir — a file that only lives in the compass work directory — causing a FileNotFoundError.

The original job_script.sh already has everything needed to restart: correct MPI config, MALI executable path, and will automatically pick up the restart via config_do_restart = .true. and restart_timestamp. compass run is only needed at the test-case level to drive EnsembleManager.

Changes to restart_member.py

Removed from setup():

  • write_job_script(...) — was silently clobbering the working original script
  • self.add_model_as_input() — unnecessary; trying to symlink the executable into the spinup dir is potentially destructive
  • ntasks / min_tasks / config.set('job', ...) / machine config block (only needed to feed write_job_script)
  • symlink(load_compass_env.sh) — original run dir already has this if required

Removed unused imports: configparser, compass.io.symlink, compass.job.write_job_script

setup() now does exactly:

  1. Verify run dir and namelist.landice exist
  2. Set config_do_restart = .true. in namelist.landice
  3. Create restart_attempt_N/ tracking directory

ensemble_manager.py: no changes required — it already cds to runStep.work_dir (the original spinup run dir) and calls sbatch job_script.sh.

Checklist

  • User's Guide has been updated
  • Developer's Guide has been updated
  • API documentation in the Developer's Guide (api.rst) has any new or modified class, method and/or functions listed
  • Documentation has been built locally and changes look as expected
  • The E3SM-Project submodule has been updated with relevant E3SM changes
  • The MALI-Dev submodule has been updated with relevant MALI changes
  • Document (in a comment titled Testing in this PR) any testing that was used to verify the changes
  • New tests have been added to a test suite
Original prompt

Problem

When compass run is invoked from the sgh_restart_ensemble test case work directory, EnsembleManager.run() calls sbatch job_script.sh from each restart step's work_dir. Because InPlaceRestartMember overrides work_dir to point at the original spinup run directory (e.g. .../spinup_ensemble/run002), the sbatch call fires correctly from that directory.

However, restart_member.py's setup() calls write_job_script(...), which overwrites the original job_script.sh in the spinup run directory with a new script that contains a compass run command. When that script runs on a compute node, compass run looks for step.pickle and then tries to open step.config_filename, which is sgh_restart_ensemble.cfg — a file that lives in the compass work directory, not the spinup run directory. The result:

FileNotFoundError: Config file does not exist:
  /pscratch/.../spinup_ensemble/run002/sgh_restart_ensemble.cfg

What we actually want

The restart runs should execute entirely within the original spinup run directory using the original job_script.sh that was created when the spinup ensemble was set up. That script already:

  • Knows the correct MPI launch command, node count, and MALI executable path
  • Runs MALI directly (not via compass run)
  • Will automatically pick up the restart because namelist.landice now has config_do_restart = .true. and a restart_timestamp file already exists

No compass run is needed at runtime. compass run is only needed at the test-case level (to run EnsembleManager, which submits the original job_script.sh scripts).

Required changes

1. compass/landice/tests/ensemble_generator/sgh_restart_ensemble/restart_member.py

In InPlaceRestartMember.setup():

  • Remove the call to write_job_script(...) — do NOT overwrite the original job_script.sh.
  • Remove the call to self.add_model_as_input() — registering the model executable as an input causes compass to try to symlink/copy the executable into the spinup run dir, which is unnecessary and potentially destructive.
  • Remove the ntasks / min_tasks / self.config.set('job', ...) / machine / write_job_script block entirely.
  • Remove the symlink(script_filename, ...) for load_compass_env.sh — the original run dir already has this if it needs it.
  • Keep the work_dir override at the top of setup() (the fix from the previous bug).
  • Keep the _set_restart_in_namelist() call.
  • Keep the restart_attempt_N/ directory tracking logic.
  • The setup() method should do exactly three things:
    1. Override self.work_dir to the original spinup run dir.
    2. Verify the run dir and namelist.landice exist.
    3. Set config_do_restart = .true. in namelist.landice.
    4. Create the restart_attempt_N/ tracking directory.

Also remove the unused imports that are no longer needed after removing write_job_script, add_model_as_input, and symlink: remove import configparser, from compass.io import symlink, from compass.job import write_job_script.

2. compass/landice/tests/ensemble_generator/ensemble_manager.py

The existing EnsembleManager.run() already does the right thing for restart steps — it cds to runStep.work_dir and calls sbatch job_script.sh. Since work_dir for restart steps is the original spinup run dir, and the original job_script.sh is preserved (by fix #1), this will work correctly without any changes needed.

However, verify that the logic is correct:

  • For restart steps, runStep.work_dir must be the original spinup run dir (e.g. .../spinup_ensemble/run002), not a subdirectory of the compass restart work dir.
  • The err_files check (glob.glob('log.landice.*.err')) and the restart_timestamp completion check should still work correctly from the original run dir.
  • No changes should be needed here if fix Replace RestartMember with InPlaceRestartMember for in-place SGH ensemble restarts #1 is applied correctly.

Constraints

  • PEP 8 formatting throughout
  • No fallback= keyword argument in any configparser / config.get*() calls
  • No timeout= argument in any subprocess calls
  • Do not copy or move any files from the spinup run directory
  • Do not create any new directories under the compass work directory for restart steps (the step's work dir IS the original run dir)

Branch

Base branch: sgh_ensemble_generator_template

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: https://github.com/alexolinhager/compass/tree/sgh_ensemble_generator_template

building off of the above branch, I want to debug the sgh_ensemble_restart test case. I want the test case to look through the summary_analysis.json file (by pointing to it in the sgh_restart_ensemble/ensemble_generator.cfg file) created by the sgh_ensemble_analysis test case and identify simulations that are not yet at steady state and satisfy the max_consecutive_restarts and min_simulation_years_before_restart requirements. I then want to cha...

This pull request was created from Copilot chat.

…artMember

- Remove write_job_script() call that overwrote the original job_script.sh
- Remove add_model_as_input() call (unnecessary and potentially destructive)
- Remove ntasks/min_tasks/config.set/machine/write_job_script config block
- Remove symlink for load_compass_env.sh
- Remove unused imports: configparser, compass.io.symlink, compass.job.write_job_script
- Update docstrings to accurately reflect the simplified setup() behavior

Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/0a459658-101e-4f84-8914-c88d0b8e7385

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix job script overwrite in spinup run directory Fix: InPlaceRestartMember no longer overwrites original job_script.sh Apr 1, 2026
Copilot AI requested a review from alexolinhager April 1, 2026 22:15
@alexolinhager alexolinhager marked this pull request as ready for review April 1, 2026 22:35
@alexolinhager alexolinhager merged commit 5f570ac into sgh_ensemble_generator_template Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants