forked from SchedMD/slurm
-
Notifications
You must be signed in to change notification settings - Fork 5
Bump to 24.11.4 #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
itkovian
wants to merge
209
commits into
hpcugent:24.11.ug
Choose a base branch
from
itkovian:24.11.4.ug
base: 24.11.ug
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Bump to 24.11.4 #75
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
If stepd_connect() should jump to rwfail, then it will return a file descriptor that was already close()ed. Ticket: 22315 Changelog: slurmd - Avoid crash due when slurmd has a communications failure with slurmstepd. Cherry-picked: 3c944ee
Cherry-pick !728 into slurm-24.11 See merge request SchedMD/dev/slurm!756
Ticket: 22370 Cherry-picked: 681a22a
Cherry-pick !764 into slurm-24.11 See merge request SchedMD/dev/slurm!768
Entry had SLURM_COMMUNICATIONS_MISSING_SOCKET_ERROR when it should have had SLURM_COMMUNICATIONS_INVALID_OUTGOING_FD. Issue: 50321 Ticket: 22312 Cherry-picked: 9a6fa96
Cherry-pick !724 into slurm-24.11 See merge request SchedMD/dev/slurm!771
The allocated fields of the yaml_event_t in _yaml_to_data() were not being freed. Ticket: 22348 Changelog: Fix memory leak when parsing yaml input. Cherry-picked: cd8303f
Ticket: 21398 Cherry-picked: 4e32ffd
Ticket: 21398 Cherry-picked: aae3e92
Ticket: 21398 Cherry-picked: de20dc6
Cherry-pick !441 into slurm-24.11 See merge request SchedMD/dev/slurm!778
Ticket: 22154 Cherry-picked: 5c7a059
Cherry-pick !758 into slurm-24.11 See merge request SchedMD/dev/slurm!775
Cherry-pick !780 into slurm-24.11 See merge request SchedMD/dev/slurm!781
This is a regression from 2e60ebc. We should only validate and take actions if part_desc->preempt_mode isn't set to NO_VAL16. Changelog: Prevent slurmctld from showing error message about PreemptMode=GANG being a cluster-wide option for `scontrol update part` calls that don't attempt to modify partition PreemptMode. Ticket: 22360 Cherry-picked: c8faf92
A partition without an explicit preempt_mode set is NO_VAL16 which will test positive against PREEMPT_MODE_GANG. Only preserve PREEMPT_MODE_GANG if the partition has an explicit preempt_mode set. See 509551c Changelog: Fix setting GANG preemption on partition when updating PreemptMode with scontrol. Ticket: 22360 Cherry-picked: e9a45ec
Cherry-pick !765 into slurm-24.11 See merge request SchedMD/dev/slurm!785
Ticket: 22390 Cherry-picked: aef437b
Cherry-pick !789 into slurm-24.11 See merge request SchedMD/dev/slurm!801
If the slurmstepd.scope/slurmd cgroup was created while having CoreSpec or MemSpec limits in the node, and then the spec limits were removed in slurm.conf and the slurmd restarted, the slurmd cgroup would remain with the old limits. This commit unsets the cpu and memory limits of slurmstepd.scope/slurmd cgroup at slurmd initialization. Changelog: Fix CoreSpec and MemSpec limits not being removed from previously configured slurmd. Ticket: 20943 Cherry-picked: 9be763a
Cherry-pick !738 into slurm-24.11 See merge request SchedMD/dev/slurm!804
Set mgr.shutdown_requested directly in atexit() callback instead of calling conmgr_request_shutdown() which can deadlock. Regression from 054247e. Ticket: 22315 Changelog: Avoid race condition that could lead to a deadlock when slurmd, slurmstepd, slurmctld, slurmrestd or sackd have a fatal event. Cherry-picked: 3991423
Changelog: Fix jobs using --ntasks-per-node and --mem keep pending forever when the requested mem divided by the number of cpus will surpass the configured MaxMemPerCPU. Ticket: 22163 Cherry-picked: 224213b
This removes an if-else block introduced in commit d413c8b. The else block was always a no-op since detail_ptr->ntasks_per_node is never expected to be NO_VAL16. If detail_ptr->ntasks_per_node is non-zero it always sets detail_ptr->num_tasks. Therefore, we can remove the if-else and assign the value directly. Ticket: 22163 Cherry-picked: 9168dd0
If a user requested a node range task count is set as a minimum when the job comes in. When the job is being scheduled we know how many nodes we are going for so we need to recalculate them then. Ticket: 22163 Cherry-picked: 484d08b
Cherry-pick !686 into slurm-24.11 See merge request SchedMD/dev/slurm!806
Cherry-pick !729 into slurm-24.11 See merge request SchedMD/dev/slurm!809
Certain descriptions and subprojects were mixed up. Cherry-picked: b062036
Cherry-pick !762 into slurm-24.11 See merge request SchedMD/dev/slurm!1030
Ticket: 21891 Cherry-picked: d535b07
Ticket: 21891 Cherry-picked: 15a786f
Ticket: 21891 Cherry-picked: c05c957
Changelog: Permit configuring the number of retry attempts to destroy CXI service via the new destroy_retries SwitchParameter. Ticket: 21891 Cherry-picked: 4c9158c
Cherry-pick !937 into slurm-24.11 See merge request SchedMD/dev/slurm!1032
Cherry-pick !992 into slurm-24.11 See merge request SchedMD/dev/slurm!1027
In slurmd we never set these limits, so we are not taking care of the reset either. Ticket: 20943 Changelog: Do not reset memory.high and memory.swap.max in slurmd startup or reconfigure as we are never really touching this in slurmd. Cherry-picked: 701a809
Ticket: 20943 Cherry-picked: b1b1947
Preparation for next commit. Ticket: 20943 Cherry-picked: 277f47f
If slurmd was started manually, with CoreSpecLimits set, and then the limits were completely removed from slurm.conf and slurmd reconfigured e.g. with a scontrol reconfig, slurmd would fail to start complaining about ENOSPC. This happens because the cpuset.cpus and cpuset.mems limits cannot be reset by writting an empty string to the interface, if there is a process in it (the slurmd itself). The kernel seems to interpret the empty string as to remove all cpus or mem nodes, even if the fact that cpuset.cpus/mems.effective shows all the available cpus or mems when the interface is empty. The solution is to explicitly specify all the cpus/mems available, so we now read the cpuset cpus and mems effective from the parent cgroup and apply them to the slurmd cgroup. Regression caused in commit master commit 9be763a and in 24.11 2504bdd. Changelog: Fix reconfigure failure of slurmd when it has been started manually and the CoreSpecLimits have been removed from slurm.conf. Ticket: 20943 Cherry-picked: 986d0b5
Ticket: 20943 Cherry-picked: 4174629
…stemd When started with systemd, the CoreSpec limits were not reset, they were only reset when starting manually. Now everytime slurmd is restarted it will inherit the limits of the parent cgroup. If it is started manually it will be the ones from the slurmstepd scope. If it is started with systemd it will normally be the ones from /sys/fs/cgroup/system.slice. Changelog: Set or reset CoreSpec limits when slurmd is reconfigured and it was started with systemd. Ticket: 20943 Cherry-picked: 50426a5
Cherry-pick !995 into slurm-24.11 See merge request SchedMD/dev/slurm!1040
Fixes a regression added in 5108bde. The slurmctld needs to unpack the step_ptr->switch_step from the save state so that it can free a step's allocated VNI on REQUEST_STEP_COMPLETE when the stepmgr is not enabled. Ticket: 22602 Changelog: switch/hpe-slingshot - Make sure the slurmctld can free step VNIs after the controller restarts or reconfigures while the job is running. Cherry-picked: 83708be
Cherry-pick !1011 into slurm-24.11 See merge request SchedMD/dev/slurm!1046
Cherry-pick !894 into slurm-24.11 See merge request SchedMD/dev/slurm!1045
Ticket: 21670 Cherry-picked: b13c8a0
Start with a verbatim copy, successive commits will simplify both. Ticket: 21670 Cherry-picked: 1c9b55b
Ticket: 21670 Cherry-picked: c3c5cf0
Ticket: 21670 Cherry-picked: 588e05b
In case of 2nd takeover by backup slurmctld bit cache is already initialized, but it's called with the same size. This shouldn't be an issue and we can safely continue. Ticket: 21670 Changelog: Fix backup slurmctld failure on 2nd takeover. Cherry-picked: 3b62fec
If the query returns an empty result, the function _cluster_remove_wckeys() returns early but does not free the query string or the mysql result. Ticket: 20771 Cherry-picked: c4e4a62
vals is never xfree()-ed, leading to a definite leak. Ticket: 20771 Cherry-picked: 26caa12
Cherry-pick !1069 into slurm-24.11 See merge request SchedMD/dev/slurm!1079
Cherry-pick !639 into slurm-24.11 See merge request SchedMD/dev/slurm!1076
This is only useful in a corner case used by the internal QA environment. Cherry-picked: 4f2e6da
Cherry-pick !1092 into slurm-24.11 See merge request SchedMD/dev/slurm!1102
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.