Skip to content

Bump to 24.11.4 #75

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 209 commits into
base: 24.11.ug
Choose a base branch
from
Open

Bump to 24.11.4 #75

wants to merge 209 commits into from

Conversation

itkovian
Copy link
Member

@itkovian itkovian commented Apr 9, 2025

No description provided.

naterini and others added 30 commits March 13, 2025 20:40
If stepd_connect() should jump to rwfail, then it will return a file
descriptor that was already close()ed.

Ticket: 22315
Changelog: slurmd - Avoid crash due when slurmd has a communications
  failure with slurmstepd.
Cherry-picked: 3c944ee
Cherry-pick !728 into slurm-24.11

See merge request SchedMD/dev/slurm!756
Cherry-pick !764 into slurm-24.11

See merge request SchedMD/dev/slurm!768
Entry had SLURM_COMMUNICATIONS_MISSING_SOCKET_ERROR when it should have
had SLURM_COMMUNICATIONS_INVALID_OUTGOING_FD.

Issue: 50321
Ticket: 22312
Cherry-picked: 9a6fa96
Cherry-pick !724 into slurm-24.11

See merge request SchedMD/dev/slurm!771
The allocated fields of the yaml_event_t in _yaml_to_data() were not being
freed.

Ticket: 22348
Changelog: Fix memory leak when parsing yaml input.
Cherry-picked: cd8303f
Continuation of commit d31cf03

Ticket: 21398
Cherry-picked: 2d3d07d
These plugins were removed previously.
Continuation of commit d31cf03

Ticket: 21398
Cherry-picked: 97218cb
Ticket: 21398
Cherry-picked: de20dc6
Cherry-pick !441 into slurm-24.11

See merge request SchedMD/dev/slurm!778
Cherry-pick !758 into slurm-24.11

See merge request SchedMD/dev/slurm!775
Cherry-pick !780 into slurm-24.11

See merge request SchedMD/dev/slurm!781
This is a regression from 2e60ebc. We should only validate and take
actions if part_desc->preempt_mode isn't set to NO_VAL16.

Changelog: Prevent slurmctld from showing error message about
 PreemptMode=GANG being a cluster-wide option for `scontrol update part`
 calls that don't attempt to modify partition PreemptMode.
Ticket: 22360
Cherry-picked: c8faf92
A partition without an explicit preempt_mode set is
NO_VAL16 which will test positive against PREEMPT_MODE_GANG. Only
preserve PREEMPT_MODE_GANG if the partition has an explicit preempt_mode
set.

See 509551c

Changelog: Fix setting GANG preemption on partition when updating
 PreemptMode with scontrol.
Ticket: 22360
Cherry-picked: e9a45ec
Cherry-pick !765 into slurm-24.11

See merge request SchedMD/dev/slurm!785
Cherry-pick !789 into slurm-24.11

See merge request SchedMD/dev/slurm!801
If the slurmstepd.scope/slurmd cgroup was created while having CoreSpec
or MemSpec limits in the node, and then the spec limits were removed in
slurm.conf and the slurmd restarted, the slurmd cgroup would remain with
the old limits.

This commit unsets the cpu and memory limits of slurmstepd.scope/slurmd
cgroup at slurmd initialization.

Changelog: Fix CoreSpec and MemSpec limits not being removed from previously
 configured slurmd.
Ticket: 20943
Cherry-picked: 9be763a
Cherry-pick !738 into slurm-24.11

See merge request SchedMD/dev/slurm!804
Set mgr.shutdown_requested directly in atexit() callback instead of
calling conmgr_request_shutdown() which can deadlock.

Regression from 054247e.

Ticket: 22315
Changelog: Avoid race condition that could lead to a deadlock when slurmd,
  slurmstepd, slurmctld, slurmrestd or sackd have a fatal event.
Cherry-picked: 3991423
Changelog: Fix jobs using --ntasks-per-node and --mem keep pending
 forever when the requested mem divided by the number of cpus
 will surpass the configured MaxMemPerCPU.
Ticket: 22163
Cherry-picked: 224213b
This removes an if-else block introduced in commit d413c8b. The else
block was always a no-op since detail_ptr->ntasks_per_node is never
expected to be NO_VAL16. If detail_ptr->ntasks_per_node is
non-zero it always sets detail_ptr->num_tasks. Therefore, we can remove
the if-else and assign the value directly.

Ticket: 22163
Cherry-picked: 9168dd0
If a user requested a node range task count is set as a minimum
when the job comes in. When the job is being scheduled we know how
many nodes we are going for so we need to recalculate them then.

Ticket: 22163
Cherry-picked: 484d08b
Cherry-pick !686 into slurm-24.11

See merge request SchedMD/dev/slurm!806
Cherry-pick !729 into slurm-24.11

See merge request SchedMD/dev/slurm!809
Certain descriptions and subprojects were mixed up.

Cherry-picked: b062036
robertsbp and others added 30 commits April 22, 2025 20:50
Cherry-pick !762 into slurm-24.11

See merge request SchedMD/dev/slurm!1030
Ticket: 21891
Cherry-picked: 15a786f
Ticket: 21891
Cherry-picked: c05c957
Changelog: Permit configuring the number of retry attempts to destroy
 CXI service via the new destroy_retries SwitchParameter.
Ticket: 21891
Cherry-picked: 4c9158c
Cherry-pick !937 into slurm-24.11

See merge request SchedMD/dev/slurm!1032
Cherry-pick !992 into slurm-24.11

See merge request SchedMD/dev/slurm!1027
In slurmd we never set these limits, so we are not taking care of the
reset either.

Ticket: 20943
Changelog: Do not reset memory.high and memory.swap.max in slurmd startup or
 reconfigure as we are never really touching this in slurmd.
Cherry-picked: 701a809
Preparation for next commit.

Ticket: 20943
Cherry-picked: 277f47f
If slurmd was started manually, with CoreSpecLimits set, and then the
limits were completely removed from slurm.conf and slurmd reconfigured
e.g. with a scontrol reconfig, slurmd would fail to start complaining
about ENOSPC.

This happens because the cpuset.cpus and cpuset.mems limits cannot be
reset by writting an empty string to the interface, if there is a process
in it (the slurmd itself).

The kernel seems to interpret the empty string as to remove all cpus
or mem nodes, even if the fact that cpuset.cpus/mems.effective shows all
the available cpus or mems when the interface is empty.

The solution is to explicitly specify all the cpus/mems available, so we
now read the cpuset cpus and mems effective from the parent cgroup and
apply them to the slurmd cgroup.

Regression caused in commit master commit 9be763a and in 24.11 2504bdd.

Changelog: Fix reconfigure failure of slurmd when it has been started
 manually and the CoreSpecLimits have been removed from slurm.conf.
Ticket: 20943
Cherry-picked: 986d0b5
…stemd

When started with systemd, the CoreSpec limits were not reset, they were
only reset when starting manually.

Now everytime slurmd is restarted it will inherit the limits of the parent
cgroup. If it is started manually it will be the ones from the slurmstepd
scope. If it is started with systemd it will normally be the ones from
/sys/fs/cgroup/system.slice.

Changelog: Set or reset CoreSpec limits when slurmd is reconfigured and it
 was started with systemd.
Ticket: 20943
Cherry-picked: 50426a5
Cherry-pick !995 into slurm-24.11

See merge request SchedMD/dev/slurm!1040
Fixes a regression added in 5108bde.

The slurmctld needs to unpack the step_ptr->switch_step from the save state
so that it can free a step's allocated VNI on REQUEST_STEP_COMPLETE when
the stepmgr is not enabled.

Ticket: 22602
Changelog: switch/hpe-slingshot - Make sure the slurmctld can free step
 VNIs after the controller restarts or reconfigures while the job is
 running.
Cherry-picked: 83708be
Cherry-pick !1011 into slurm-24.11

See merge request SchedMD/dev/slurm!1046
Cherry-pick !894 into slurm-24.11

See merge request SchedMD/dev/slurm!1045
Ticket: 21670
Cherry-picked: b13c8a0
Start with a verbatim copy, successive commits will simplify both.

Ticket: 21670
Cherry-picked: 1c9b55b
In case of 2nd takeover by backup slurmctld bit cache is already
initialized, but it's called with the same size. This shouldn't be an
issue and we can safely continue.

Ticket: 21670
Changelog: Fix backup slurmctld failure on 2nd takeover.
Cherry-picked: 3b62fec
If the query returns an empty result, the function _cluster_remove_wckeys()
returns early but does not free the query string or the mysql result.

Ticket: 20771
Cherry-picked: c4e4a62
vals is never xfree()-ed, leading to a definite leak.

Ticket: 20771
Cherry-picked: 26caa12
Cherry-pick !1069 into slurm-24.11

See merge request SchedMD/dev/slurm!1079
Cherry-pick !639 into slurm-24.11

See merge request SchedMD/dev/slurm!1076
This is only useful in a corner case used by the internal QA environment.

Cherry-picked: 4f2e6da
Cherry-pick !1092 into slurm-24.11

See merge request SchedMD/dev/slurm!1102
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.