Skip to content

Update appliance for stackhpc.openhpc nodegroup/partition changes #666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented May 8, 2025

TODO: Update description below once below PR merges

Changes to cope with stackhpc/ansible-role-openhpc#183 which changed openhpc_slurm_partitions to openhpc_nodegroups and openhpc_partitions:

  • common openhpc config
  • skeleton templating
  • caas infra and ansible
  • stackhpc environment openhpc overides
  • rebuild config - now automatic
  • stackhpc.openhpc:validate.yml should get called from ansible/validate.yml

Note that:

  • This removes the templating of partitions from opentofu; these are now all configured via ansible, using a new
    cluster_compute_groups variable templated into the inventory/hosts.yml file.
  • A default openhpc_nodegroups is calculated in environments/common/inventory/group_vars/all/openhpc.yml. This creates a node group per key in the OpenTofu compute variable. This is usually what is required, but this variable will overriding if custom Node-level Slurm parameters/configuration is required (e.g. for GRES).
  • A default openhpc_partitions is calculated. This includes one partition per node group, plus a rebuild partition if the rebuild group is active. This should not generally be overriden. Instead normal (non-rebuild) partitions should be modified using openhpc_user_partitions, which is the same format.

#665 is replaced by this.

@sjpb
Copy link
Collaborator Author

sjpb commented May 8, 2025

TODO: test this with adding extra nodes in stackhpc env
[edit:] done, see below.

@sjpb sjpb marked this pull request as ready for review May 9, 2025 13:32
@sjpb sjpb requested a review from a team as a code owner May 9, 2025 13:32
@sjpb sjpb changed the title PoC of automating partition/nodegroup config Update appliance for stackhpc.openhpc nodegroup/partition changes May 9, 2025
@sjpb
Copy link
Collaborator Author

sjpb commented May 9, 2025

Testing in .stackhpc env with adding "extra" nodes:

  1. Default configuration, i.e. just adding nodes into 2nd compute group:
$ git diff environments/.stackhpc/tofu/main.tf
diff --git i/environments/.stackhpc/tofu/main.tf w/environments/.stackhpc/tofu/main.tf
index 8d78401b..ea58d4b0 100644
--- i/environments/.stackhpc/tofu/main.tf
+++ w/environments/.stackhpc/tofu/main.tf
@@ -84,7 +84,7 @@ module "cluster" {
         # Normally-empty partition for testing:
         extra: {
             nodes: []
-            #nodes: ["extra-0", "extra-1"]
+            nodes: ["extra-0", "extra-1"]
             flavor: var.other_node_flavor
         }
     }
$ ansible login -a sinfo
RL9-login-0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
extra        up 60-00:00:0      2   idle RL9-extra-[0-1]
standard*    up 60-00:00:0      2   idle RL9-compute-[0-1]

Ok.

  1. With the above, changing to an explicitly-configured partition covering both compute groups:
$ git diff environments/.stackhpc/inventory/group_vars/all/openhpc.yml
diff --git i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
index 5aac5f8a..6dd18e9e 100644
--- i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
+++ w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
@@ -1,3 +1,8 @@
 openhpc_config_extra:
   SlurmctldDebug: debug
   SlurmdDebug: debug
+openhpc_user_partitions:
+  - name: hpc
+    nodegroups:
+      - standard
+      - extra
$ ansible-playbook ansible/slurm.yml --tags openhpc
$ ansible login -a sinfo
RL9-login-0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
hpc*         up 60-00:00:0      4   idle RL9-compute-[0-1],RL9-extra-[0-1]

OK.

  1. With overlapping partitions:
$ git diff environments/.stackhpc/inventory/group_vars/all/openhpc.yml
diff --git i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
index 5aac5f8a..9235f8e0 100644
--- i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
+++ w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
@@ -1,3 +1,11 @@
 openhpc_config_extra:
   SlurmctldDebug: debug
   SlurmdDebug: debug
+openhpc_user_partitions:
+  - name: normal
+    nodegroups:
+      - standard
+  - name: all
+    nodegroups:
+      - standard
+      - extra
$ ansible-playbook ansible/slurm.yml --tags openhpc
$ ansible login -a sinfo
RL9-login-0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal       up 60-00:00:0      2   idle RL9-compute-[0-1]
all*         up 60-00:00:0      4   idle RL9-compute-[0-1],RL9-extra-[0-1]

OK

@sjpb sjpb force-pushed the feat/nodegroups-v1 branch 4 times, most recently from a7b5cc1 to 32fb617 Compare May 9, 2025 15:44
@sjpb sjpb force-pushed the feat/nodegroups-v1 branch from 32fb617 to e471458 Compare May 9, 2025 15:47
@sjpb
Copy link
Collaborator Author

sjpb commented May 9, 2025

WIP: testing on azimuth as slurm-v21 - running, but need to redeploy slurm after last force-push

Copy link
Collaborator

@jovial jovial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible to me. Just need to wait for the openhpc role changes to merge so that we can update requirements.yml.

@sjpb
Copy link
Collaborator Author

sjpb commented May 13, 2025

Now fully tested on CaaS.

@sjpb sjpb marked this pull request as draft May 13, 2025 11:18
@sjpb
Copy link
Collaborator Author

sjpb commented May 13, 2025

Converted to draft because #668 must merge first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants