Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fool-proofing automated slurm workflow #39

Open
mtanneau opened this issue Feb 22, 2024 · 3 comments
Open

Fool-proofing automated slurm workflow #39

mtanneau opened this issue Feb 22, 2024 · 3 comments

Comments

@mtanneau
Copy link
Contributor

I ran into an issue when using the new slurm automated workflow: my sysimage job failed, which prevented the rest of the jobs from ever running.
I don't know how, but having a failsafe would be nice.

I did not see anything in slurm that allows something like "if job B depends on job A and job A failed, then fail job B" rather than "if job B depends on job A and job A failed, then job B will wait forever" (the latter is slurm's current behavior).

Some possibilities:

  • something magic in slurm that would do the above one?
  • use --dependency=afterany and track job progress differently?

TBH, this would be more of a convenience, not the highest priority.

@klamike
Copy link
Collaborator

klamike commented Feb 22, 2024

We can have each job submit the next one upon completion? This lets us decide how much time/memory to give the sampler job based on the ref job too.

Honestly, the current behavior isn't too bad. Instead of failing silently and leaving the user to investigate why the jobs finished but the results files are not there, the user simply checks the queue, sees where it failed, and can fix & re-submit.

@klamike
Copy link
Collaborator

klamike commented Apr 23, 2024

Using this issue as a catch-all for potential future pipeline improvements...

Right now the pipeline only works out-of-the-box if submitting to a SLURM cluster.

Currently to run locally one can use the commands below, then delete the $export_dir/res_h5 folder. One problem with this is that it doesn't (automatically) parallelize.

julia --project=. slurm/make_ref.jl path/to/config.toml
julia --project=. exp/sampler.jl path/to/config.toml 1 100
julia --project=. slurm/merge.jl path/to/config.toml
julia --project=. slurm/cleanup.jl path/to/config.toml

I think it would require only minor edits to provide a similar experience to the SLURM pipeline when running locally, but it's not high priority.

@mtanneau
Copy link
Contributor Author

Right now the pipeline only works out-of-the-box if submitting to a SLURM cluster.

I'm OK with that, we are the primary users of this workflow.

to run locally [...] one problem is that it doesn't (automatically) parallelize.

I'm also OK with that for now. To me, the main limitation of the current setup is that we are not able to unit-test it fully. But the slurm pipeline works great and I have nothing bad to say about it :)

it's not high priority

Agreed. Our energy is better spent elsewhere (e.g. building documentation or supporting additional OPF formulations).
I'll also point out that one goal is that we generate the datasets, so that people can readily use it. Hence my focus on improving the experience of downstream data users, rather than ours generating data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants