Skip to content

Releases: stackhpc/ansible-slurm-appliance

v2.1

25 Jun 11:23
933dcf4
Compare
Choose a tag to compare

What's Changed

  • Pass templated fqdn to ansible by @sjpb in #702
  • Fix NHC downing nodes due to /efi being mounted twice by @sjpb in #719

Full Changelog: v2.0...v2.1

Images

There are no new images at this release.

v2.0

24 Jun 16:01
5f2f931
Compare
Choose a tag to compare

Key Changes

  • ⚠️ BACKWARDS-INCOMPATIBLE CHANGE ⚠️ To allow adding support for Multi-Instance GPUs in #656, the variable openhpc_slurm_partitions has been replaced by openhpc_nodegroups and openhpc_partitions. See #666 for full details and examples.
  • The tuned profile hpc-compute is now usable for nodes with hugepages enabled. See #672.
  • Support for LBNL's Node Health Checks added in #654. See ansible/roles/nhc/README.md
  • Support for configuring Multi-Instance GPUs (MIG) added in #656 - see docs/mig.md for full details.
  • Changes to the stackhpc.openhpc role for the above also mean that parameters can be removed from slurm.conf. See stackhpc/ansible-role-openhpc#184.
  • NVIDIA (open) drivers were upgraded to version 575.57.08 with cuda 12.9.1 in #703.
  • Lustre bug LU-18085 affecting Rocky Linux 9 was fixed in #688.
  • The appliance will now raise an error if Ansible Galaxy installs do not match requirements.yml, in #700
  • OS packages were updated for both Rocky Linux 8 and 9 in #707, with the latter now having the last updates for Rocky Linux 9.5.

What's Changed

All PRs, oldest first:

  • Fix typo in comment by @priteau in #675
  • Update appliance for stackhpc.openhpc nodegroup/partition changes by @sjpb in #666
  • Bump CUDA to 12.9 and NVIDIA driver to 575 by @priteau in #687
  • Fix environment creation from skeleton by @priteau in #682
  • Make home volume creation optional by @sjpb in #673
  • Add a simple index in the docs README. by @MoteHue in #669
  • Make packer var image_disk_format functional by @sjpb in #694
  • Remove description of un-implemented dummy interface/default route by @sjpb in #689
  • Allow specifying instance root volume type by @sjpb in #693
  • Add fix for Lustre bug LU-18085 by @sjpb in #688
  • Fix docs for cacerts role to mention cacerts_cert_dir by @technowhizz in #696
  • Update operations.md typo by @technowhizz in #697
  • Change host definition for cacert play to allow builder by @technowhizz in #698
  • Add validation for OpenTofu compute and login variables by @sjpb in #674
  • Remove incorrect note re ondemand for demo deployment by @sjpb in #695
  • Fix nvidia build at open driver version 575.57.08 with cuda 12.9.1 by @jovial in #703
  • Fix tuned hpc-compute with hugepages and verify applied profile by @sjpb in #672
  • Support fixed IP addresses for nodes by @priteau in #643
  • Ensure Ansible Galaxy installs are up to date by @sjpb in #700
  • Bump all Pulp snapshots to latest versions in RL 8.x, RL 9.5 by @priteau in #707
  • Add support for Node Health Checks by @sjpb in #654
  • Add support for configuring Multi-Instance GPUs (MIG) by @jovial in #656
  • Bump fatimage after MIG PR656 merge by @sjpb in #716

Full Changelog: v1.161...test

Images

Two new images are available:

  • RL8: openhpc-RL8-250624-0854-75099868
  • RL9: openhpc-RL9-250624-0854-75099868

v1.161

15 May 16:17
980f108
Compare
Choose a tag to compare

What's Changed

Bumps slurm versions to fix CVE-2025-43904:

  • Upgrade to OpenHPC/Slurm versions RL9=3.1.1/24.11.5 RL8=2.9.1/23.11.11 by @sjpb in #668
  • Perform Slurm database upgrade if necessary by @sjpb in #670
  • Automate image release by @sjpb in #671

Caution

This is a Slurm major version update for RockyLinux 9 (= OpenHPC v3) clusters.

These clusters will perform a Slurm database upgrade on slurmdbd startup. They will backup the entire state volume via a volume snapshot before performing the backup. See #670 and linked dependency PRs for full information.

Full Changelog: v1.160...v1.161

Images

Two new images are available:

  • RockyLinux 8: openhpc-RL8-250514-1502-5a923b2c
  • RockyLinux 9: openhpc-RL9-250514-1502-5a923b2c

v1.161-rc1

14 May 10:08
5a7608b
Compare
Choose a tag to compare
v1.161-rc1 Pre-release
Pre-release

What's Changed

Bumps slurm versions to fix CVE-2025-43904:

  • Upgrade to OpenHPC/Slurm versions RL9=3.1.1/24.11.5 RL8=2.9.1/23.11.11 by @sjpb in #668

Caution

This is a Slurm major version update for RockyLinux 9 (= OpenHPC v3) clusters.

These clusters will perform a Slurm database upgrade on slurmdbd startup. The startup timeout for that service has been increased to 45 minutes to allow for that. However it is recommended that this database (in /var/lib/state/mysql on the control node) is backed-up before starting slurmdbd, for example by snapshotting the $CLUSTER_NAME-state volume after the reimage (so the service is stopped) but before running the site.yml playbook.

Full Changelog: v1.160...v1.161

Images

Two new images are available:

  • RockyLinux 8: openhpc-RL8-250513-1045-ca44f898
  • RockyLinux 9: openhpc-RL9-250513-1046-ca44f898

v1.160

09 May 13:10
01b5aa8
Compare
Choose a tag to compare

What's Changed

  • Allow enabling package installs for caas clusters via extravars by @sjpb in #667

Full Changelog: v1.159...v1.160

There are no new images for this release, see v1.159.

v1.159

08 May 10:10
611513c
Compare
Choose a tag to compare

What's Changed

In summary:

  • Updated OS dnf packages
  • Updated NVIDIA driver and CUDA packages, for sites building images including the cuda group
  • Updated grafana to v10
  • Various fixes (mostly for root-squashed NFS home directory mounts) and feature completion
  • Improved documentation
  • Fixes the Zenith proxy in CaaS clusters for RL9

  • Compute-init: cope with root-squashed nfs clients by @bertiethorpe in #627
  • Update terraform provider openstack to v3 by @sd109 in #578
  • Fix some typos by @priteau in #629
  • Add docs with sequence diagrams for operations by @sjpb in #456
  • Update nvidia drivers (to 570-open) CUDA packages (to 12.8.1-1) and samples playbook by @priteau in #628
  • Fix dropin directory creation by @jovial in #631
  • Test upgrade from latest release to current branch image in CI by @sjpb in #576
  • Compute-Init: wait for cloud-init before NFS mount by @JohnGarbutt in #635
  • Update dnf repos using latest Pulp timestamps (plus tooling) by @sjpb in #621
  • Ensure no_proxy entries are unique by @technowhizz in #633
  • Fix typos in docs by @priteau in #639
  • Correct vnic_types var name in skeleton variables by @MoteHue in #640
  • Document (and test) slurm controlled rebuild configuration and usage by @sjpb in #634
  • Fix site.yml hanging on initial deploy by @sjpb in #648
  • Fix cuda installs by @MoteHue in #652
  • Use checksum verification for CernVM-FS GPG key by @priteau in #641
  • fix nightly cleanup for duplicate server names by @bertiethorpe in #653
  • Add support for alertmanager by @sjpb in #649
  • Fix fatimage build without alertmanager secret by @sjpb in #655
  • Fix typos in docs by @priteau in #658
  • Change fat image build to create raw image for speed by @JohnGarbutt in #650
  • Allow empty items in extra package and user lists by @priteau in #637
  • Fix nightly-cleanup workflow by @bertiethorpe in #660
  • Fix creation of hpctests directory by @priteau in #659
  • Default hpctests_group to hpctests_user by @sjpb in #663
  • Fix caas zenith/hpctests/basic_users by @sjpb in #662
  • Update grafana to v10 using Ark rpms by @sjpb in #664
  • Allow modifying nodes fully-qualified name by @sjpb in #651

Full Changelog: v1.158...v1.159

Images

Two new images are available:

  • RockyLinux 8: openhpc-RL8-250506-1259-abb6394b
  • RockyLinux 9: openhpc-RL9-250506-1259-abb6394b

v1.158

19 Mar 08:49
120bcfc
Compare
Choose a tag to compare

What's Changed

New features

  • Support multiple networks in OpenTofu configurations by @sjpb in #548
  • Support attaching FIPs to login nodes by @sjpb in #572
  • Support for configuring chrony by @jovial in #575
  • Control default routes on boot by @sjpb in #617
  • Support mapping compute & login instances to Ironic nodes by @sjpb in #573
  • Add support for configuring CA certificates by @sjpb in #574

Important fixes and changes from previous release

  • Support lustre on Rocky 8 by @jovial in #566
  • Fix lustre IP route detection if there is no gateway by @jovial in #567
  • Support sshd password authentication on Rocky 8 by @jovial in #565
  • Ensure oddjobd is enabled/started by @jovial in #564
  • Add lustre_repo variable by @jovial in #563
  • Define login nodes using an opentofu module by @sjpb in #547
  • Lower hpl memory fraction to reduce stress from defaults by @sjpb in #591
  • Root-squash nfs exports by default by @sjpb in #599
  • Restrict all nfs shares to nfs group IPs by @sjpb in #607
  • Lustre: Harden mount options by @jovial in #618
  • Manila/CephFS and NFS: harden mounts to prevent setuid and devices by @sjpb in #619

Other changes

  • Read k3s_token from secrets.yml file by @sjpb in #540
  • Remove slurm_openstack_tools collection by @sjpb in #537
  • Rename terraform/ directories to tofu/ by @sjpb in #541
  • Fix squid/dnf ordering problem by @sjpb in #546
  • Optionally ignore image changes in TF by @bertiethorpe in #545
  • Change docs/ references from Terraform to OpenTofu by @bertiethorpe in #544
  • avoid tf updates to login/compute on control delete/recreate by @sjpb in #555
  • Set k3s node IP from access network IP by @sjpb in #556
  • docs: update README to use new network syntax by @priteau in #560
  • Support compute node rebuild/reboot via Slurm RebootProgram by @bertiethorpe in #553
  • Document compute-init image requirements by @sjpb in #569
  • Support tuned in compute-init by @sjpb in #570
  • Support memory limits and pam no-login in compute-init by @bertiethorpe in #568
  • docs: fix OpenTofu file names in README by @priteau in #562
  • Support sssd and sshd in compute-init by @bertiethorpe in #571
  • Reword recommendation about image by @priteau in #580
  • Fix link to Open OnDemand documentation by @priteau in #584
  • Fix some typos by @priteau in #583
  • Make no_proxy list more configurable by @sd109 in #579
  • Fix wrong path to Ansible inventory by @priteau in #587
  • Support setting PYTHON_VERSION by @priteau in #588
  • Disable compute-init by default & warn of security issue by @sjpb in #585
  • Fix basic_users not modifying default nfs-shared home correctly by @sjpb in #590
  • Support disabling port security by @sjpb in #592
  • Use bootstrap tokens provisioned by ansible for K3s instead of persistent tokens in cloud-init metadata by @wtripp180901 in #589
  • Fixed bootstrap tokens not being idempotent by @wtripp180901 in #597
  • Fix: Support networks not owned by openstack project by @bertiethorpe in #598
  • Remove support for setting VNIC binding profiles by @priteau in #586
  • Prevent nfs being mounted by tunnelling/forwarding through login node by @sjpb in #595
  • Enable lustre in compute-init by @bertiethorpe in #581
  • Fix OpenTofu execution as admin by @priteau in #582
  • FIX: Tofu attempts to apply security groups when port_security_enabled is false by @bertiethorpe in #601
  • Add file deletion to cleanup play by @sjpb in #600
  • Disable nightly builds by @bertiethorpe in #603
  • Fix chrony for nodes w/o network access (yet) by @sjpb in #605
  • Fix typo in variables.tf by @technowhizz in #609
  • Compute-init: Optimise dir copies + Numerical sort playbook + new nodes to existing cluster by @bertiethorpe in #611
  • Fix builds not in stackhpc env by @sjpb in #615
  • Fix documentation of sssd_install_ldap variable by @priteau in #613
  • docs: fix typo by @priteau in #623
  • Updated README so image consistent with codebase by @wtripp180901 in #610
  • Add image share script by @sjpb in #624
  • Enable creating users with local homedirs by @sjpb in #626

New Contributors

Full Changelog: v1.157...v1.158

New images

Two new images are available:

  • RockyLinux 8: openhpc-RL8-250312-1522-7e5c051d
  • RockyLinux 9: openhpc-RL9-250312-1435-7e5c051d

v1.157

15 Jan 13:29
5f7e48f
Compare
Choose a tag to compare

What's Changed

  • Update ceph to use ark packages and move RL9 to ceph reef by @wtripp180901 in #519
  • Add more information re. configuring production sites by @sjpb in #508
  • Change defaults so a cookiecutter environment is fully functional by @wtripp180901 in #473
  • Fix epel not using Ark repos for RL8 by @wtripp180901 in #526
  • Fix volume_backed_instances not working for compute nodes by @sjpb in #527
  • Generate and persist hostkeys for ondemand and login nodes by @wtripp180901 in #525
  • Support additional volumes on compute nodes by @sjpb in #528
  • Support SSSD and optionally LDAP by @sjpb in #438
  • Fix nightly cleanup to deal with duplicate server names by @bertiethorpe in #532
  • Fix various typos in documentation by @priteau in #530
  • Fix environment creation steps by @priteau in #531
  • Support and test "re-imageable" compute nodes via compute node metadata by @bertiethorpe in #518
  • Document required security groups by @priteau in #534
  • Bump Zenith client to latest from azimuth-cloud namespace by @m-bull in #437
  • Fix yaml formatting in operations docs by @sjpb in #535
  • Enable image builds to install extra packages by default by @sjpb in #536

Image Details

Two new images are available

  • RL8: openhpc-RL8-250114-1627-bccc88b5
  • RL9: openhpc-RL9-250114-1626-bccc88b5

New Contributors

Full Changelog: v1.156...v1.157

v1.156

07 Jan 13:20
4def5ba
Compare
Choose a tag to compare

What's Changed

Due to the size of this release, PRs are grouped below. In brief:

  • This release addresses various breakages caused by changes to upstream repos. As a result, as of this release the StackHPC images (see below) ship with all dnf repos disabled and either credentials for StackHPC's ark server or a local Pulp server mirrored from ark are required in order to build images.
  • OFED and CUDA are no longer shipped in StacHPC images and require an image build to add.
  • StackHPC images move to RockyLinux 9.5 and 8.10.
  • Added support for NVIDIA DOCA instead of OFED.
  • Added support for Lustre clients.
  • OpenHPC role supports using the same nodes in multiple partitions/groups.
  • Additional packages can be added via appliances_default_extra_packages.

Isolation from upstream dnf repos

New functionality

  • Support lustre client by @sjpb in #447
  • Install k3s cluster with ansible init by @wtripp180901 in #441
  • Make block device detection work on ESXi by @mkjpryor in #481
  • Add role to install NVIDIA DOCA on top of an existing "fat" image by @sjpb in #492
  • Fix DOCA install cleanup deleteing /tmp by @sjpb in #494
  • Add list of additional package installs by @wtripp180901 in #499
  • EXPERIMENTAL: add machinery to allow compute nodes to rejoin cluster on reimage by @sjpb in #500
  • Ansible-init compute node script by @bertiethorpe in #476

Docs

  • Add missing bits re. initial setup to refactored README by @sjpb in #464
  • Add generic upgrade docs by @sjpb in #462
  • Add note about login node reboot when changing OOD servername by @sd109 in #510

Fixes

  • Remove local DNS as a dependency for k3s by @sjpb in #442
  • Fix adhoc/rebuild wait_for_connection race condition by @bertiethorpe in #483
  • Fix Lustre deleting rdma packages and bump to v2.15.6 for RL9.5 support by @wtripp180901 in #502

Upgrades

  • Upgrade RL8 ceph to quincy + trivy rate limit and OOD false positives fix by @wtripp180901 in #477
  • Bump openhpc role for slurm restart, templating and nodes in multiple groups by @sjpb in #488

Internal CI changes/fixes

  • Don't run trivy scan on nightly builds by @sjpb in #467
  • Unset signature_verified property from nightly/latest images by @sjpb in #474
  • Don't fail cluster cleanup when prefix not found by @bertiethorpe in #480
  • Fix nightly images getting timestamp/git hash by @sjpb in #493
  • Fix nightly build version (v2) by @sjpb in #495
  • Remove use of FIPs for leafcloud packer builds by @sjpb in #498

Image Details

Two new images are available (neither of which now contain OFED) :

  • RL8: openhpc-RL8-250106-0916-f8603056
  • RL9: openhpc-RL9-250106-0916-f8603056

New Contributors

Full Changelog: v1.155...v1.156

v1.155

24 Oct 13:18
6f1554c
Compare
Choose a tag to compare

What's Changed

  • Prevent ansible-init running during packer build by @wtripp180901 in #439
  • Ensure podman copes with a hard reboot by @sjpb in #460
  • Add workflow to cleanup CI clusters by @sjpb in #451

Image Details

Three new images are available, all with OFED:

  • openhpc-RL8-241022-0441-a5affa58
  • openhpc-RL9-241022-0038-a5affa58
  • openhpc-cuda-RL9-241022-0441-a5affa58

New Contributors

Full Changelog: v1.154...v1.155