Skip to content

Autoscaler config and new runner for testing #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 30, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions ci/cluster/oci/autoscaler/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Cluster autoscaler configuration

Configuration files for the cluster autoscaler for the OKE cluster running
external GitHub Actions.

REFS:
Step by Step
https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengusingclusterautoscaler_topic-Working_with_the_Cluster_Autoscaler.htm#Working_with_the_Cluster_Autoscaler

OKE Workload Identity: Greater control of access
https://blogs.oracle.com/cloud-infrastructure/post/oke-workload-identity-greater-control-access

## Step 1: Setting Up an Instance Principal or Workload Identity Principal to Enable Cluster Autoscaler Access to Node Pools

### Using instance principals to enable access to node pools🔗
Created Instance Principal

### Create a new compartment-level dynamic group containing the worker nodes (compute instances) in the cluster:

https://cloud.oracle.com/identity/domains/ocid1.domain.oc1..aaaaaaaaqlvbp36i7exr5phcr4jy4o33fn7vw5vtd4h4rxmwzzfpf4dtylea/dynamic-groups/ocid1.dynamicgroup.oc1..aaaaaaaa7qbdtn3zbnph3yy62gjyr5i2ls7cvwe3pzoimmjckzg5cyki3bzq/application-roles?region=us-sanjose-1

### Policy to allow work nodes to manage nodes pools:

https://cloud.oracle.com/identity/domains/policies/ocid1.policy.oc1..aaaaaaaanawfi3j4otvdhlefhgf5fogr2wnhjzljpxmf4afjwufd3zknmk7q?region=us-sanjose-1

### Using workload identity principals to enable access to node pools

https://cloud.oracle.com/identity/domains/policies/ocid1.policy.oc1..aaaaaaaaqbjexxhyrjdjf2py2vchiz6dg7ewt4qburayq7n35k4fnuoirg7q?region=us-sanjose-1S

Step 2: Copy and customize the Cluster Autoscaler configuration file

178 changes: 178 additions & 0 deletions ci/cluster/oci/autoscaler/cluster-autoscaler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["events", "endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["endpoints"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch", "list", "get", "patch", "update"]
- apiGroups: [""]
resources:
- "pods"
- "services"
- "replicationcontrollers"
- "persistentvolumeclaims"
- "persistentvolumes"
verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
resources: ["replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["watch", "list"]
- apiGroups: ["apps"]
resources: ["statefulsets", "replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "csinodes"]
verbs: ["watch", "list", "get"]
- apiGroups: ["batch", "extensions"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resourceNames: ["cluster-autoscaler"]
resources: ["leases"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["watch", "list"]
- apiGroups: ["storage.k8s.io"]
resources: ["csidrivers", "csistoragecapacities"]
verbs: ["watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create","list","watch"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
verbs: ["delete", "get", "update", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 3
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8085'
spec:
serviceAccountName: cluster-autoscaler
containers:
- image: phx.ocir.io/oracle/oci-cluster-autoscaler:1.31.0-1
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=oci
- --max-node-provision-time=25m
- --nodes=1:10:ocid1.nodepool.oc1.us-sanjose-1.aaaaaaaaxjbwe3w6qswmyflqesvj76cy2fhzwvm6ztamsqptpngnhkzehmha
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --unremovable-node-recheck-timeout=5m
- --balance-similar-node-groups
- --balancing-ignore-label=displayName
- --balancing-ignore-label=hostname
- --balancing-ignore-label=internal_addr
- --balancing-ignore-label=oci.oraclecloud.com/fault-domain
- --skip-nodes-with-system-pods=false

imagePullPolicy: "Always"
env:
- name: OKE_USE_INSTANCE_PRINCIPAL
value: "true"
- name: OCI_SDK_APPEND_USER_AGENT
value: "oci-oke-cluster-autoscaler"
203 changes: 203 additions & 0 deletions ci/cluster/oci/autoscaler/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["events", "endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["endpoints"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
resources:
- "namespaces"
- "pods"
- "services"
- "replicationcontrollers"
- "persistentvolumeclaims"
- "persistentvolumes"
verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
resources: ["replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["watch", "list"]
- apiGroups: ["apps"]
resources: ["statefulsets", "replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
resources:
["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
verbs: ["watch", "list", "get"]
- apiGroups: ["batch", "extensions"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resourceNames: ["cluster-autoscaler"]
resources: ["leases"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames:
["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
verbs: ["delete", "get", "update", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8085"
spec:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
# Node affinity is used to force cluster-autoscaler to stick
# to the master node. This allows the cluster to reliably downscale
# to zero worker nodes when needed.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
serviceAccountName: cluster-autoscaler
containers:
- image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
imagePullPolicy: Always
env:
- name: INSTALLED_CCM
value: cloud-provider-equinix-metal
- name: METAL_CONTROLLER_NODE_IDENTIFIER_LABEL
value: node-role.kubernetes.io/control-plane
- name: METAL_AUTH_TOKEN
valueFrom:
secretKeyRef:
name: cluster-autoscaler-equinixmetal
key: authtoken
# You can take advantage of multiple nodepools by adding
# extra arguments on the cluster-autoscaler command.
# e.g. for pool1, pool2
# --nodes=0:10:pool1
# --nodes=0:10:pool2
command:
- ./cluster-autoscaler
- --alsologtostderr
- --cluster-name=cluster1
- --cloud-config=/config/cloud-config
- --cloud-provider=equinixmetal
- --expander=price
- --nodes=1:20:pool1
- --nodes=1:10:pool2
- --nodes=1:5:pool3
- --scale-down-unneeded-time=1m0s
- --scale-down-delay-after-add=1m0s
- --scale-down-unready-time=1m0s
- --v=2
volumeMounts:
- name: cloud-config
mountPath: /config
readOnly: true
volumes:
- name: cloud-config
secret:
secretName: cluster-autoscaler-cloud-config
77 changes: 77 additions & 0 deletions ci/cluster/oci/autoscaler/service-accounts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler-role
rules:
- apiGroups: [""]
resources: ["events", "endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["endpoints"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
resources:
- "namespaces"
- "pods"
- "services"
- "replicationcontrollers"
- "persistentvolumeclaims"
- "persistentvolumes"
verbs: ["watch", "list", "get"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["watch", "list"]
- apiGroups: ["apps"]
resources: ["daemonsets", "replicasets", "statefulsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "csinodes"]
verbs: ["watch", "list", "get"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create","list","watch"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
verbs: ["delete", "get", "update"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler-rolebinding
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler-role
subjects:
- kind: ServiceAccount
name: cluster-autoscaler-account
namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cluster-autoscaler-account
namespace: kube-system
4 changes: 2 additions & 2 deletions ci/cluster/oci/runners/16cpu-64gb/argo.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: oci-16cpu-64gb
name: oracle-16cpu-64gb
namespace: argocd
spec:
project: default
@@ -10,7 +10,7 @@ spec:
repoURL: ghcr.io/actions/actions-runner-controller-charts
targetRevision: 0.10.1
helm:
releaseName: oci-16cpu-64gb
releaseName: oracle-16cpu-64gb
valueFiles:
- $values/ci/cluster/oci/runners/16cpu-64gb/values.yaml
- repoURL: 'https://github.com/cncf/automation.git'
24 changes: 6 additions & 18 deletions ci/cluster/oci/runners/16cpu-64gb/values.yaml
Original file line number Diff line number Diff line change
@@ -13,9 +13,9 @@ controllerServiceAccount:
## maxRunners is the max number of runners the autoscaling runner set will scale up to.
maxRunners: 100

## minRunners is the min number of idle runners. The target number of runners created will be
## calculated as a sum of minRunners and the number of jobs assigned to the scale set.
minRunners: 1
## minRunners min number of idle runners.
## Target number of runners = minRunners + number of jobs assigned to scale set.
minRunners: 1

# runnerGroup: "default"

@@ -63,7 +63,7 @@ containerMode:
# annotations:

## template is the PodSpec for each listener Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
## https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
listenerTemplate:
spec:
tolerations:
@@ -186,10 +186,10 @@ template:
command: ["/home/runner/run.sh"]
resources:
requests:
memory: 56Gi
memory: 64Gi
cpu: 16
limits:
memory: 60Gi
memory: 68Gi
cpu: 20
- name: dind
image: docker:dind
@@ -214,15 +214,3 @@ template:
volumes:
- name: work
emptyDir: {}
# We need to assume the DIND socket volumes are being provided
# This is because Helm + Argo is busted :) The previous values won't work properly

## Optional controller service account that needs to have required Role and RoleBinding
## to operate this gha-runner-scale-set installation.
## The helm chart will try to find the controller deployment and its service account at installation time.
## In case the helm chart can't find the right service account, you can explicitly pass in the following value
## to help it finish RoleBinding with the right service account.
## Note: if your controller is installed to only watch a single namespace, you have to pass these values explicitly.
# controllerServiceAccount:
# namespace: arc-system
# name: test-arc-gha-runner-scale-set-controller