Fool-proofing automated slurm workflow #39

mtanneau · 2024-02-22T01:02:22Z

I ran into an issue when using the new slurm automated workflow: my sysimage job failed, which prevented the rest of the jobs from ever running.
I don't know how, but having a failsafe would be nice.

I did not see anything in slurm that allows something like "if job B depends on job A and job A failed, then fail job B" rather than "if job B depends on job A and job A failed, then job B will wait forever" (the latter is slurm's current behavior).

Some possibilities:

something magic in slurm that would do the above one?
use --dependency=afterany and track job progress differently?

TBH, this would be more of a convenience, not the highest priority.

The text was updated successfully, but these errors were encountered:

klamike · 2024-02-22T15:11:48Z

We can have each job submit the next one upon completion? This lets us decide how much time/memory to give the sampler job based on the ref job too.

Honestly, the current behavior isn't too bad. Instead of failing silently and leaving the user to investigate why the jobs finished but the results files are not there, the user simply checks the queue, sees where it failed, and can fix & re-submit.

klamike · 2024-04-23T16:04:38Z

Using this issue as a catch-all for potential future pipeline improvements...

Right now the pipeline only works out-of-the-box if submitting to a SLURM cluster.

Currently to run locally one can use the commands below, then delete the $export_dir/res_h5 folder. One problem with this is that it doesn't (automatically) parallelize.

julia --project=. slurm/make_ref.jl path/to/config.toml
julia --project=. exp/sampler.jl path/to/config.toml 1 100
julia --project=. slurm/merge.jl path/to/config.toml
julia --project=. slurm/cleanup.jl path/to/config.toml

I think it would require only minor edits to provide a similar experience to the SLURM pipeline when running locally, but it's not high priority.

mtanneau · 2024-04-23T16:09:48Z

Right now the pipeline only works out-of-the-box if submitting to a SLURM cluster.

I'm OK with that, we are the primary users of this workflow.

to run locally [...] one problem is that it doesn't (automatically) parallelize.

I'm also OK with that for now. To me, the main limitation of the current setup is that we are not able to unit-test it fully. But the slurm pipeline works great and I have nothing bad to say about it :)

it's not high priority

Agreed. Our energy is better spent elsewhere (e.g. building documentation or supporting additional OPF formulations).
I'll also point out that one goal is that we generate the datasets, so that people can readily use it. Hence my focus on improving the experience of downstream data users, rather than ours generating data.

klamike mentioned this issue Mar 14, 2025

Rename package --> PGLearn #177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fool-proofing automated slurm workflow #39

Fool-proofing automated slurm workflow #39

mtanneau commented Feb 22, 2024

klamike commented Feb 22, 2024 •

edited

Loading

klamike commented Apr 23, 2024 •

edited

Loading

mtanneau commented Apr 23, 2024

Fool-proofing automated slurm workflow #39

Fool-proofing automated slurm workflow #39

Comments

mtanneau commented Feb 22, 2024

klamike commented Feb 22, 2024 • edited Loading

klamike commented Apr 23, 2024 • edited Loading

mtanneau commented Apr 23, 2024

klamike commented Feb 22, 2024 •

edited

Loading

klamike commented Apr 23, 2024 •

edited

Loading