Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: hpa with cpu + mem util scaling options #628

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

burnjake
Copy link

@burnjake burnjake commented Mar 7, 2024

  • CHANGELOG.md updated - n/a?
  • Rebased/mergable
  • Tests pass (see comment below)
  • Sign CLA (if not already signed)

We would like to scale the number of replicas based on usage which is a slight pain currently as you have to set the deployment.spec.replicas field to none if we were to roll our own HPA resource. There's also a pre-existing issue: #624.

$ helm version
version.BuildInfo{Version:"v3.14.2", GitCommit:"c309b6f0ff63856811846ce18f3bdc93d2b4d54b", GitTreeState:"clean", GoVersion:"go1.22.0"}

Setting autoscaling.enabled: true templates the following Deployment and HPA resources:

$ cat values.yaml | grep autoscaling -A10
autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80
  behavior: {}

$ helm template ./ -s templates/deployment.yaml
---
# Source: telegraf/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: release-name-telegraf
  labels:
    helm.sh/chart: telegraf-1.8.43
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    app.kubernetes.io/instance: release-name
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: telegraf
      app.kubernetes.io/instance: release-name
  template:
    metadata:
      labels:
        app.kubernetes.io/name: telegraf
        app.kubernetes.io/instance: release-name
      annotations:
        checksum/config: 11e7bc3db613c177911535018f65051a22f67ef0cf419dc2f19448d2a629282f
    spec:
      serviceAccountName: release-name-telegraf
      containers:
      - name: telegraf
        image: "docker.io/library/telegraf:1.29-alpine"
        imagePullPolicy: "IfNotPresent"
        resources:
          {}
        env:
        - name: HOSTNAME
          value: telegraf-polling-service
        volumeMounts:
        - name: config
          mountPath: /etc/telegraf
      volumes:
      - name: config
        configMap:
          name: release-name-telegraf

$ helm template ./ -s templates/horizontalpodautoscaler.yaml
---
# Source: telegraf/templates/horizontalpodautoscaler.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: release-name-telegraf
  labels:
    helm.sh/chart: telegraf-1.8.43
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    app.kubernetes.io/instance: release-name
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: release-name-telegraf
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80

Setting autoscaling.enabled: false templates the following Deployment resource:

$ cat values.yaml | grep autoscaling -A10
autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80
  behavior: {}

helm template ./ -s templates/deployment.yaml
---
# Source: telegraf/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: release-name-telegraf
  labels:
    helm.sh/chart: telegraf-1.8.43
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    app.kubernetes.io/instance: release-name
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: telegraf
      app.kubernetes.io/instance: release-name
  template:
    metadata:
      labels:
        app.kubernetes.io/name: telegraf
        app.kubernetes.io/instance: release-name
      annotations:
        checksum/config: 11e7bc3db613c177911535018f65051a22f67ef0cf419dc2f19448d2a629282f
    spec:
      serviceAccountName: release-name-telegraf
      containers:
      - name: telegraf
        image: "docker.io/library/telegraf:1.29-alpine"
        imagePullPolicy: "IfNotPresent"
        resources:
          {}
        env:
        - name: HOSTNAME
          value: telegraf-polling-service
        volumeMounts:
        - name: config
          mountPath: /etc/telegraf
      volumes:
      - name: config
        configMap:
          name: release-name-telegraf

$ helm template ./ -s templates/horizontalpodautoscaler.yaml
Error: could not find template templates/horizontalpodautoscaler.yaml in chart

An example with behaviour:

$ cat values.yaml | grep autoscaling -A20
autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80
  behavior:
    scaleDown:
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
      - type: Percent
        value: 10
        periodSeconds: 60

$ helm template ./ -s templates/horizontalpodautoscaler.yaml
---
# Source: telegraf/templates/horizontalpodautoscaler.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: release-name-telegraf
  labels:
    helm.sh/chart: telegraf-1.8.43
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: telegraf
    app.kubernetes.io/instance: release-name
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: release-name-telegraf
  minReplicas: 1
  maxReplicas: 5
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 60
        type: Pods
        value: 4
      - periodSeconds: 60
        type: Percent
        value: 10
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80

name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so I understand that the autoscaler will launch additional telegraf nodes if you get above a certain memory and CPU usage, but what ensures that the first pod gets reduced usage? Is there a load balancer or some other proxy in front that would round robin the usage?

Trying to understand the full use-case and how a user would take advantage of this without needing to make modifications to their config. Thanks!

Copy link
Author

@burnjake burnjake Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Apologies I've been away for a few days. So our use case is to utilise the opentelemetry input, aggregate with basicstats and output with the prometheusclient. We have a traffic pattern where the number of connections varies quite a lot within the day, so varying our replica count is prudent.

As the opentelemetry input expects connections via gRPC, we can't depend on normal load balancing via a k8s service and instead we need to use rely on an external LB which we've plumbed into the ingress of the cluster which will discover the new replicas and do the things to spread the traffic (update its connection pool I think?). In short, we don't need extra configuration within telegraf for this to work, but our use case is indeed very specific!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants