Skip to content

Provide error messages on failure to start slurm daemons #95

Open
@sjpb

Description

@sjpb

If this fails, then journalctl or systemctl status might well have useful info, e.g. if you specify two partitions which share nodes (which is legal to slurm, but isn't handled by our current templating) then:

  • slurmctld appears to start from the ansible but actually fails
  • slurmd shows Unable to start service slurmd: Job for slurmd.service failed because a timeout was exceeded. See "systemctl status slurmd.service" and "journalctl -xe" for details.

but actually the control node shows:

$ sudo journalctl -xe
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-2 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-3 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: fatal: Duplicated NodeHostName nrel-hpc-0 in config file

and

$ sudo systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2021-02-24 09:21:38 UTC; 5min ago
  Process: 26178 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 26180 (code=exited, status=1/FAILURE)

Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: layouts: no layout to initialize
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-2 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-3 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: fatal: Duplicated NodeHostName nrel-hpc-0 in config file
Feb 24 09:21:38 nrel-control.novalocal systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Feb 24 09:21:38 nrel-control.novalocal systemd[1]: slurmctld.service: Failed with result 'exit-code'.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions