Releases: oracle-quickstart/oci-hpc-oke
Releases · oracle-quickstart/oci-hpc-oke
OKE RDMA Quickstart Resource Manager template v26.3.0
What's Changed
- Monitoring update by @OguzPastirmaci in #136
- Fix interfaces in BM.GPU.B300.8 manifest by @OguzPastirmaci in #137
- Add MI355X.8 RCCL manifest and disable legacy IMDS endpoint by @subash-m in #135
- Add support to customize bootstrap by @robo-cap in #138
- Add support for identity domains by @OguzPastirmaci in #139
- Update OS images and NCCL tests images in the manifests by @OguzPastirmaci in #140
- Coredns addon fix by @OguzPastirmaci in #141
- add option to create bastion service by @shethdhvani in #143
- Install Kueue & RDMA topology labeler DS by @OguzPastirmaci in #145
- improve wait cloud-init conditions for the operator node by @robo-cap in #146
- Add missing via-operator flags to SSH key bundling condition by @OguzPastirmaci in #147
- Update SSH authorized keys handling in OKE workers by @OguzPastirmaci in #148
- [tests] Add ability to pass ssh_public_key in tfvars file by @robo-cap in #151
- Bastion service updates by @OguzPastirmaci in #152
- Confirm SSH public key has comment when using deployment via operator by @robo-cap in #153
- improve custom image selection by @robo-cap in #149
- Add GitHub Actions workflows for stack testing by @OguzPastirmaci in #154
- Bump dorny/paths-filter from 3.0.2 to 4.0.1 by @dependabot[bot] in #155
- Bump hashicorp/setup-terraform from 3.1.2 to 4.0.0 by @dependabot[bot] in #159
- Bump actions/github-script from 7.1.0 to 8.0.0 by @dependabot[bot] in #158
- Bump actions/checkout from 4.3.1 to 6.0.2 by @dependabot[bot] in #156
- Bump actions/cache from 4.3.0 to 5.0.4 by @dependabot[bot] in #157
- Add FSS testing workflows and replace actions that are not allowed to be used by @OguzPastirmaci in #160
- Bump actions/setup-go from 5.6.0 to 6.3.0 by @dependabot[bot] in #161
- Bump actions/cache from 4.3.0 to 5.0.4 by @dependabot[bot] in #162
- Bump github.com/stretchr/testify from 1.9.0 to 1.11.1 in /test by @dependabot[bot] in #163
- Bump actions/github-script from 7.1.0 to 8.0.0 by @dependabot[bot] in #164
- Bump actions/checkout from 4.3.1 to 6.0.2 by @dependabot[bot] in #166
- Bump github.com/gruntwork-io/terratest from 0.48.0 to 0.56.0 in /test by @dependabot[bot] in #165
- Fix GB validation issue by @OguzPastirmaci in #168
- Update hashicorp/kubernetes requirement from ~> 2.29.0 to ~> 3.0.1 in /terraform by @dependabot[bot] in #167
- Add support for hostexec by @robo-cap in #170
- Simplify CI workflows by @OguzPastirmaci in #172
- Update test readme by @OguzPastirmaci in #173
- Various fixes by @robo-cap in #174
- Various fixes by @OguzPastirmaci in #175
- Doc fixes by @OguzPastirmaci in #176
New Contributors
- @subash-m made their first contribution in #135
- @dependabot[bot] made their first contribution in #155
Full Changelog: v26.2.0...v26.3.0
v26.3.0-rc1
What's Changed
- Monitoring update by @OguzPastirmaci in #136
- Fix interfaces in BM.GPU.B300.8 manifest by @OguzPastirmaci in #137
- Add MI355X.8 RCCL manifest and disable legacy IMDS endpoint by @subash-m in #135
- Add support to customize bootstrap by @robo-cap in #138
- Add support for identity domains by @OguzPastirmaci in #139
- Update OS images and NCCL tests images in the manifests by @OguzPastirmaci in #140
- Coredns addon fix by @OguzPastirmaci in #141
New Contributors
Full Changelog: v26.2.0...v26.3.0-rc1
OKE RDMA Quickstart Resource Manager template v26.2.0
What's Changed
- Update readme by @OguzPastirmaci in #87
- Update path for OCI CLI in helm-deployment.tf by @OguzPastirmaci in #89
- Fix ubuntu repo by @robo-cap in #92
- Issue number: 90 - fss mount on all worker nodes by @subburamoracle in #91
- Install OKE node client packages from local repo if it exists by @OguzPastirmaci in #93
- Improve ons-webhook resiliency by @robo-cap in #94
- Add retry function to cloud init by @OguzPastirmaci in #95
- Module fixes and improvements by @robo-cap in #96
- Use NSGs instead of SLs for Lustre Service by @robo-cap in #100
- Update NPD values file by @OguzPastirmaci in #102
- Add NCCL tests manifest for BM.GPU.GB200-v3.4 and update the other manifests to use NCCL 2.29 by @OguzPastirmaci in #103
- Add Terratest tests by @OguzPastirmaci in #101
- Add the document for replacing the boot volumes of self-managed nodes by @OguzPastirmaci in #106
- Update NCCL/RCCL images by @OguzPastirmaci in #107
- Add check to wait until kubeconfig exists by @OguzPastirmaci in #108
- Add MI355 manifest and update other manifests by @OguzPastirmaci in #109
- Move GPU Fryer active health checks to Python by @OguzPastirmaci in #110
- Update BM.GPU.MI355X-v1.8.yaml by @OguzPastirmaci in #111
- added support for VM.DenseIO shapes by @shethdhvani in #114
- Update replacing node using BVR guide by @OguzPastirmaci in #115
- Fix pod logs mount by @robo-cap in #118
- Replace Nginx Ingress controller with Contour by @robo-cap in #117
- Fix: Set to retentionSize for Prometheus by @sam-andaluri in #119
- Update contour helm values by @robo-cap in #120
- Add NCCL tests manifest for BM.GPU.GB300.4 by @OguzPastirmaci in #121
- Update BM.GPU.GB300.4.yaml by @OguzPastirmaci in #122
- Add cloud-shell support to the BVR script by @robo-cap in #123
- Remove BV high storage class by @OguzPastirmaci in #126
- Add option to change services CIDR by @OguzPastirmaci in #127
- Add NCCL tests 2.29.3 images by @OguzPastirmaci in #124
- Update Node Problem Detector checks by @OguzPastirmaci in #130
- Add an option to the OKE stack to use an existing Dynamic Group by @subburamoracle in #105
- Bump chart versions by @OguzPastirmaci in #131
- Add per-pool kubernetes version, max pods, and node cycling by @OguzPastirmaci in #128
- Larger CIDR to accomodate more nodes by @OguzPastirmaci in #129
- BugFix: Fix alert webhook to reduce chances of duplicate alerts by @sam-andaluri in #133
- set kubeproxy to use ipvs & several small tweaks by @robo-cap in #132
- Increase DCGM Exporter memory limits by @OguzPastirmaci in #134
New Contributors
- @subburamoracle made their first contribution in #91
- @shethdhvani made their first contribution in #114
- @sam-andaluri made their first contribution in #119
Full Changelog: v25.11.0...v26.2.0
OKE RDMA Quickstart Resource Manager template v25.11.0
- Add option to install OCIR credential helper
- Fix for Metrics Server
- Add support to use image URIs
Full Changelog: v25.10.0...v25.11.0
OKE RDMA Quickstart Resource Manager template v25.10.0
- Kubernetes upgrade: Added support for Kubernetes v1.34
- Documentation: New guide — Deploying Prometheus & Grafana Stack with Dashboards and Alerts manually
- Health checks:
- Added RCCL tests
- Added RocM Validation Suite (RVS)
gst_singlefor AMD validation
- Grafana access link: Default domain updated to
endpoint.oci-hpc.ai, configurable for custom domains - Component updates: Refreshed dependencies and minor fixes across the stack
Full Changelog: v25.9.0...v25.10.0
OKE RDMA Quickstart Resource Manager template v25.9.0
- Option to provision a shared Lustre file system and a PV backed by the Lustre file system
- Fully private clusters using Resource Manager Private Endpoint for deployment
- Same dashboards and notifications with the Slurm stack
- Option to use Oracle Linux for non-RDMA pools
- Component updates
OKE RDMA Quickstart Resource Manager template v25.5.1
This is a hotfix release to fix the breaking Helm provider change.
More info about the change here: hashicorp/terraform-provider-helm#1637
OKE RDMA Quickstart Resource Manager template v25.5.0
- Added AMD Device Metrics Exporter
- Added AMD dashboards
OKE RDMA Quickstart Resource Manager template v25.4.0
- Added Kubernetes v1.32
- Changed the default number of maximum pods per node to 110
OKE RDMA Quickstart Resource Manager template v25.3.1
- OKE AMD GPU device plugin is enabled for BM.GPU.MI300X.8 shape
- OKE DCGM Exporter is disabled (upstream DCGM Exporter is deployed)
- Helm fix for Grafana load balancer not being deleted properly on Terraform destroy
- Updated the health checks for Node Problem Detector
- Updated Grafana dashboards
- Added the required policies for Oracle Cloud Agent GPU/RDMA monitoring