Update appliance for stackhpc.openhpc nodegroup/partition changes #666

sjpb · 2025-05-08T20:58:51Z

TODO: Update description below once below PR merges

Changes to cope with stackhpc/ansible-role-openhpc#183 which changed openhpc_slurm_partitions to openhpc_nodegroups and openhpc_partitions:

common openhpc config
skeleton templating
caas infra and ansible
stackhpc environment openhpc overides
rebuild config - now automatic
stackhpc.openhpc:validate.yml should get called from ansible/validate.yml

Note that:

This removes the templating of partitions from opentofu; these are now all configured via ansible, using a new
cluster_compute_groups variable templated into the inventory/hosts.yml file.
A default openhpc_nodegroups is calculated in environments/common/inventory/group_vars/all/openhpc.yml. This creates a node group per key in the OpenTofu compute variable. This is usually what is required, but this variable will overriding if custom Node-level Slurm parameters/configuration is required (e.g. for GRES).
A default openhpc_partitions is calculated. This includes one partition per node group, plus a rebuild partition if the rebuild group is active. This should not generally be overriden. Instead normal (non-rebuild) partitions should be modified using openhpc_user_partitions, which is the same format.

#665 is replaced by this.

sjpb · 2025-05-08T21:01:04Z

TODO: test this with adding extra nodes in stackhpc env
[edit:] done, see below.

…change

sjpb · 2025-05-09T14:08:58Z

Testing in .stackhpc env with adding "extra" nodes:

Default configuration, i.e. just adding nodes into 2nd compute group:

$ git diff environments/.stackhpc/tofu/main.tf
diff --git i/environments/.stackhpc/tofu/main.tf w/environments/.stackhpc/tofu/main.tf
index 8d78401b..ea58d4b0 100644
--- i/environments/.stackhpc/tofu/main.tf
+++ w/environments/.stackhpc/tofu/main.tf
@@ -84,7 +84,7 @@ module "cluster" {
         # Normally-empty partition for testing:
         extra: {
             nodes: []
-            #nodes: ["extra-0", "extra-1"]
+            nodes: ["extra-0", "extra-1"]
             flavor: var.other_node_flavor
         }
     }
$ ansible login -a sinfo
RL9-login-0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
extra        up 60-00:00:0      2   idle RL9-extra-[0-1]
standard*    up 60-00:00:0      2   idle RL9-compute-[0-1]

Ok.

With the above, changing to an explicitly-configured partition covering both compute groups:

$ git diff environments/.stackhpc/inventory/group_vars/all/openhpc.yml
diff --git i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
index 5aac5f8a..6dd18e9e 100644
--- i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
+++ w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
@@ -1,3 +1,8 @@
 openhpc_config_extra:
   SlurmctldDebug: debug
   SlurmdDebug: debug
+openhpc_user_partitions:
+  - name: hpc
+    nodegroups:
+      - standard
+      - extra
$ ansible-playbook ansible/slurm.yml --tags openhpc
$ ansible login -a sinfo
RL9-login-0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
hpc*         up 60-00:00:0      4   idle RL9-compute-[0-1],RL9-extra-[0-1]

OK.

With overlapping partitions:

$ git diff environments/.stackhpc/inventory/group_vars/all/openhpc.yml
diff --git i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
index 5aac5f8a..9235f8e0 100644
--- i/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
+++ w/environments/.stackhpc/inventory/group_vars/all/openhpc.yml
@@ -1,3 +1,11 @@
 openhpc_config_extra:
   SlurmctldDebug: debug
   SlurmdDebug: debug
+openhpc_user_partitions:
+  - name: normal
+    nodegroups:
+      - standard
+  - name: all
+    nodegroups:
+      - standard
+      - extra
$ ansible-playbook ansible/slurm.yml --tags openhpc
$ ansible login -a sinfo
RL9-login-0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal       up 60-00:00:0      2   idle RL9-compute-[0-1]
all*         up 60-00:00:0      4   idle RL9-compute-[0-1],RL9-extra-[0-1]

OK

sjpb · 2025-05-09T15:58:39Z

WIP: testing on azimuth as slurm-v21 - running, but need to redeploy slurm after last force-push

jovial

Looks sensible to me. Just need to wait for the openhpc role changes to merge so that we can update requirements.yml.

sjpb · 2025-05-13T10:57:27Z

Now fully tested on CaaS.

sjpb · 2025-05-13T11:19:09Z

Converted to draft because #668 must merge first.

PoC of automating partition/nodegroup config

38e5fd3

sjpb mentioned this pull request May 8, 2025

Adds support for configuring MIG #656

Open

sjpb added 8 commits May 9, 2025 08:43

update rebuild docs

3bf7db9

fixup ondemand partitions for openhpc_partitions

d730da6

fixup rebuild role docs for openhpc_partitions

4e641d1

fix caas for openhpc_partitions/openhpc_nodegroups

5e5b389

make caas provisioning less confusing

1f99e4a

fix openhpc_partition config for stackhpc.openhpc groups->nodegroups …

2b752ba

…change

Merge branch 'main' into feat/nodegroups-v1

2ebe631

run stackhpc_openhpc validation

7ce38c6

sjpb marked this pull request as ready for review May 9, 2025 13:32

sjpb requested a review from a team as a code owner May 9, 2025 13:32

sjpb changed the title ~~PoC of automating partition/nodegroup config~~ Update appliance for stackhpc.openhpc nodegroup/partition changes May 9, 2025

fix caas nodegroups typo

c8af52e

sjpb force-pushed the feat/nodegroups-v1 branch 4 times, most recently from a7b5cc1 to 32fb617 Compare May 9, 2025 15:44

fix partitions for caas and non-rebuilt-enabled clusters

e471458

sjpb force-pushed the feat/nodegroups-v1 branch from 32fb617 to e471458 Compare May 9, 2025 15:47

jovial reviewed May 12, 2025

View reviewed changes

sjpb marked this pull request as draft May 13, 2025 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update appliance for stackhpc.openhpc nodegroup/partition changes #666

Update appliance for stackhpc.openhpc nodegroup/partition changes #666

sjpb commented May 8, 2025 •

edited

Loading

sjpb commented May 8, 2025 •

edited

Loading

sjpb commented May 9, 2025 •

edited

Loading

sjpb commented May 9, 2025

jovial left a comment

sjpb commented May 13, 2025

sjpb commented May 13, 2025

Update appliance for stackhpc.openhpc nodegroup/partition changes #666

Are you sure you want to change the base?

Update appliance for stackhpc.openhpc nodegroup/partition changes #666

Conversation

sjpb commented May 8, 2025 • edited Loading

sjpb commented May 8, 2025 • edited Loading

sjpb commented May 9, 2025 • edited Loading

sjpb commented May 9, 2025

jovial left a comment

Choose a reason for hiding this comment

sjpb commented May 13, 2025

sjpb commented May 13, 2025

sjpb commented May 8, 2025 •

edited

Loading

sjpb commented May 8, 2025 •

edited

Loading

sjpb commented May 9, 2025 •

edited

Loading