Open
Description
With an image-based deploy the current workflow for adding a node looks like:
- Boot a new compute node. It will attempt to join the cluster, slurmctld will say it doesn't have a nodename entry, and slurmd will die.
- Run the role on the ENTIRE cluster, so that:
- new slurm.conf generated including the new node
- slurmctld and ALL slurmd restarted (inc. the new, failed one) in the correct order
Item 2 is really noisy as all the compute nodes run all the ansible. It would be good if really we could just run the appropriate steps for these cases.
I think the cases covered are:
- Adding nodes with an appropriate image
- Deleting nodes
We probably could do something just using the configure
tag, but this needs testing/documenting.
Metadata
Metadata
Assignees
Labels
No labels