Open
Description
If this fails, then journalctl or systemctl status might well have useful info, e.g. if you specify two partitions which share nodes (which is legal to slurm, but isn't handled by our current templating) then:
- slurmctld appears to start from the ansible but actually fails
- slurmd shows
Unable to start service slurmd: Job for slurmd.service failed because a timeout was exceeded. See "systemctl status slurmd.service" and "journalctl -xe" for details.
but actually the control node shows:
$ sudo journalctl -xe
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-2 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-3 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: fatal: Duplicated NodeHostName nrel-hpc-0 in config file
and
$ sudo systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2021-02-24 09:21:38 UTC; 5min ago
Process: 26178 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 26180 (code=exited, status=1/FAILURE)
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: layouts: no layout to initialize
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-2 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-3 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: fatal: Duplicated NodeHostName nrel-hpc-0 in config file
Feb 24 09:21:38 nrel-control.novalocal systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Feb 24 09:21:38 nrel-control.novalocal systemd[1]: slurmctld.service: Failed with result 'exit-code'.