-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fool-proofing automated slurm workflow #39
Comments
We can have each job submit the next one upon completion? This lets us decide how much time/memory to give the sampler job based on the ref job too. Honestly, the current behavior isn't too bad. Instead of failing silently and leaving the user to investigate why the jobs finished but the results files are not there, the user simply checks the queue, sees where it failed, and can fix & re-submit. |
Using this issue as a catch-all for potential future pipeline improvements... Right now the pipeline only works out-of-the-box if submitting to a SLURM cluster. Currently to run locally one can use the commands below, then delete the julia --project=. slurm/make_ref.jl path/to/config.toml
julia --project=. exp/sampler.jl path/to/config.toml 1 100
julia --project=. slurm/merge.jl path/to/config.toml
julia --project=. slurm/cleanup.jl path/to/config.toml I think it would require only minor edits to provide a similar experience to the SLURM pipeline when running locally, but it's not high priority. |
I'm OK with that, we are the primary users of this workflow.
I'm also OK with that for now. To me, the main limitation of the current setup is that we are not able to unit-test it fully. But the slurm pipeline works great and I have nothing bad to say about it :)
Agreed. Our energy is better spent elsewhere (e.g. building documentation or supporting additional OPF formulations). |
I ran into an issue when using the new slurm automated workflow: my sysimage job failed, which prevented the rest of the jobs from ever running.
I don't know how, but having a failsafe would be nice.
I did not see anything in slurm that allows something like "if job B depends on job A and job A failed, then fail job B" rather than "if job B depends on job A and job A failed, then job B will wait forever" (the latter is slurm's current behavior).
Some possibilities:
--dependency=afterany
and track job progress differently?TBH, this would be more of a convenience, not the highest priority.
The text was updated successfully, but these errors were encountered: