Fix: InPlaceRestartMember no longer overwrites original job_script.sh#11
Merged
alexolinhager merged 2 commits intoApr 1, 2026
Conversation
…artMember - Remove write_job_script() call that overwrote the original job_script.sh - Remove add_model_as_input() call (unnecessary and potentially destructive) - Remove ntasks/min_tasks/config.set/machine/write_job_script config block - Remove symlink for load_compass_env.sh - Remove unused imports: configparser, compass.io.symlink, compass.job.write_job_script - Update docstrings to accurately reflect the simplified setup() behavior Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/0a459658-101e-4f84-8914-c88d0b8e7385 Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix job script overwrite in spinup run directory
Fix: InPlaceRestartMember no longer overwrites original job_script.sh
Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
InPlaceRestartMember.setup()was overwriting the spinup run directory'sjob_script.shwith a new script invokingcompass run. At runtime on the compute node,compass runlooked forsgh_restart_ensemble.cfgrelative to the spinup run dir — a file that only lives in the compass work directory — causing aFileNotFoundError.The original
job_script.shalready has everything needed to restart: correct MPI config, MALI executable path, and will automatically pick up the restart viaconfig_do_restart = .true.andrestart_timestamp.compass runis only needed at the test-case level to driveEnsembleManager.Changes to
restart_member.pyRemoved from
setup():write_job_script(...)— was silently clobbering the working original scriptself.add_model_as_input()— unnecessary; trying to symlink the executable into the spinup dir is potentially destructiventasks/min_tasks/config.set('job', ...)/machineconfig block (only needed to feedwrite_job_script)symlink(load_compass_env.sh)— original run dir already has this if requiredRemoved unused imports:
configparser,compass.io.symlink,compass.job.write_job_scriptsetup()now does exactly:namelist.landiceexistconfig_do_restart = .true.innamelist.landicerestart_attempt_N/tracking directoryensemble_manager.py: no changes required — it alreadycds torunStep.work_dir(the original spinup run dir) and callssbatch job_script.sh.Checklist
api.rst) has any new or modified class, method and/or functions listedE3SM-Projectsubmodule has been updated with relevant E3SM changesMALI-Devsubmodule has been updated with relevant MALI changesTestingin this PR) any testing that was used to verify the changesOriginal prompt
Problem
When
compass runis invoked from thesgh_restart_ensembletest case work directory,EnsembleManager.run()callssbatch job_script.shfrom each restart step'swork_dir. BecauseInPlaceRestartMemberoverrideswork_dirto point at the original spinup run directory (e.g..../spinup_ensemble/run002), thesbatchcall fires correctly from that directory.However,
restart_member.py'ssetup()callswrite_job_script(...), which overwrites the originaljob_script.shin the spinup run directory with a new script that contains acompass runcommand. When that script runs on a compute node,compass runlooks forstep.pickleand then tries to openstep.config_filename, which issgh_restart_ensemble.cfg— a file that lives in the compass work directory, not the spinup run directory. The result:What we actually want
The restart runs should execute entirely within the original spinup run directory using the original
job_script.shthat was created when the spinup ensemble was set up. That script already:compass run)namelist.landicenow hasconfig_do_restart = .true.and arestart_timestampfile already existsNo
compass runis needed at runtime.compass runis only needed at the test-case level (to runEnsembleManager, which submits the originaljob_script.shscripts).Required changes
1.
compass/landice/tests/ensemble_generator/sgh_restart_ensemble/restart_member.pyIn
InPlaceRestartMember.setup():write_job_script(...)— do NOT overwrite the originaljob_script.sh.self.add_model_as_input()— registering the model executable as an input causes compass to try to symlink/copy the executable into the spinup run dir, which is unnecessary and potentially destructive.ntasks/min_tasks/self.config.set('job', ...)/machine/write_job_scriptblock entirely.symlink(script_filename, ...)forload_compass_env.sh— the original run dir already has this if it needs it.work_diroverride at the top ofsetup()(the fix from the previous bug)._set_restart_in_namelist()call.restart_attempt_N/directory tracking logic.setup()method should do exactly three things:self.work_dirto the original spinup run dir.namelist.landiceexist.config_do_restart = .true.innamelist.landice.restart_attempt_N/tracking directory.Also remove the unused imports that are no longer needed after removing
write_job_script,add_model_as_input, andsymlink: removeimport configparser,from compass.io import symlink,from compass.job import write_job_script.2.
compass/landice/tests/ensemble_generator/ensemble_manager.pyThe existing
EnsembleManager.run()already does the right thing for restart steps — itcds torunStep.work_dirand callssbatch job_script.sh. Sincework_dirfor restart steps is the original spinup run dir, and the originaljob_script.shis preserved (by fix #1), this will work correctly without any changes needed.However, verify that the logic is correct:
runStep.work_dirmust be the original spinup run dir (e.g..../spinup_ensemble/run002), not a subdirectory of the compass restart work dir.err_filescheck (glob.glob('log.landice.*.err')) and therestart_timestampcompletion check should still work correctly from the original run dir.Constraints
fallback=keyword argument in anyconfigparser/config.get*()callstimeout=argument in anysubprocesscallsBranch
Base branch:
sgh_ensemble_generator_templateThe following is the prior conversation context from the user's chat exploration (may be truncated):
User: https://github.com/alexolinhager/compass/tree/sgh_ensemble_generator_template
building off of the above branch, I want to debug the sgh_ensemble_restart test case. I want the test case to look through the summary_analysis.json file (by pointing to it in the sgh_restart_ensemble/ensemble_generator.cfg file) created by the sgh_ensemble_analysis test case and identify simulations that are not yet at steady state and satisfy the max_consecutive_restarts and min_simulation_years_before_restart requirements. I then want to cha...
This pull request was created from Copilot chat.