Skip to content

HPA and NGF Controller Conflicting #4007

@cmbankester

Description

@cmbankester

Describe the bug

When autoscaling.enable: true is configured in the Helm chart, the NGF controller updates the deployment and modifies the spec.replicas field in conflict with the HPA. This causes the deployment to scale up and down in the same second, resulting in constant pod churn and preventing the HPA from scaling up or down consistently.

To Reproduce

  1. Deploy NGF with autoscaling enabled using these Helm values:
nginx:
  autoscaling:
    enable: true
    metrics:
      - external:
          metric:
            name: <some-external-metric-providing-connection-count-across-all-replicas>
          target:
            type: Value
            value: 20000
        type: External
    minReplicas: 1
    maxReplicas: 10
  1. Wait for HPA to trigger a scale-down event

  2. Observe scale events:

kubectl get events -n ngf --sort-by='.lastTimestamp' -o custom-columns='when:lastTimestamp,msg:message,reason:reason,obj:involvedObject.name,cmp:source.component' | grep -E "SuccessfulRescale|ScalingReplicaSet"
  1. Check who last updated the deployment replicas
kubectl get deployment nginx-public-gateway-nginx -n ngf --show-managed-fields -o json | \
  jq '.metadata.managedFields[] | select(.fieldsV1."f:spec"."f:replicas") | {manager: .manager, operation: .operation, time: .time}'

Expected behavior

When autoscaling.enable: true, the NGF controller should:

  1. Create the HPA resource
  2. Not change the spec.replicas field after HPA is created
  3. Allow HPA to be the sole controller managing replica count

Your environment

  • Version of NGINX Gateway Fabric: 2.1.2 (commit: 877c415, date: 2025-09-25T19:31:07Z)
  • Kubernetes Version: v1.32.6
  • Platform: Azure Kubernetes Service (AKS)
  • Exposure method: Service type LoadBalancer
  • Helm Chart Version: nginx-gateway-fabric-2.1.2

Observed behavior

Events show deployment scaling up and down in the same second:

> kubectl get events -n immy-routing --sort-by='.lastTimestamp' -o custom-columns='when:lastTimestamp,msg:message,reason:reason,obj:involvedObject.name,cmp:source.component' | grep -E "SuccessfulRescale|ScalingReplicaSet"
2025-10-02T18:17:53Z   New size: 10; reason: external metric datadogmetric@immy-routing:nginx-connection-count-connections(nil) above target      SuccessfulRescale   nginx-public-gateway-nginx                      horizontal-pod-autoscaler
2025-10-02T18:19:38Z   Scaled down replica set nginx-public-gateway-nginx-57b699c549 from 10 to 8                                                 ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:19:38Z   Scaled up replica set nginx-public-gateway-nginx-57b699c549 from 8 to 10                                                   ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:21:38Z   Scaled down replica set nginx-public-gateway-nginx-57b699c549 from 10 to 9                                                 ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:21:38Z   Scaled up replica set nginx-public-gateway-nginx-57b699c549 from 9 to 10                                                   ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:25:23Z   Scaled up replica set ngf-nginx-gateway-fabric-74db69c968 from 0 to 1                                                      ScalingReplicaSet   ngf-nginx-gateway-fabric                        deployment-controller
2025-10-02T18:25:26Z   Scaled down replica set ngf-nginx-gateway-fabric-7b99997d79 from 1 to 0                                                    ScalingReplicaSet   ngf-nginx-gateway-fabric                        deployment-controller
2025-10-02T18:25:39Z   New size: 9; reason: All metrics below target                                                                              SuccessfulRescale   nginx-public-gateway-nginx                      horizontal-pod-autoscaler
2025-10-02T18:51:42Z   Scaled down replica set nginx-public-gateway-nginx-57b699c549 from 9 to 8                                                  ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:51:42Z   New size: 8; reason: All metrics below target                                                                              SuccessfulRescale   nginx-public-gateway-nginx                      horizontal-pod-autoscaler

Checking managed fields confirms NGF controller ("gateway" manager) is modifying replicas in the same second as the hpa:

> kubectl get deployment nginx-public-gateway-nginx -n immy-routing --show-managed-fields -o json | \
  jq '.metadata.managedFields[] | select(.fieldsV1."f:spec"."f:replicas") | {manager: .manager, operation: .operation, time: .time}'
{
  "manager": "gateway",
  "operation": "Update",
  "time": "2025-10-02T18:51:42Z"
}

And the replica count set by the HPA has been overwritten back to the old value:

> kubectl get deployment nginx-public-gateway-nginx -n immy-routing -o json | jq '.spec.replicas'     
9

Additional context

Suspected root cause: The NGF controller is updating the deployment, including the spec.replicas field, even when HPA is enabled and results a race condition:

  1. HPA decides to scale (e.g., 10 → 8 replicas)
  2. HPA updates deployment .spec.replicas: 8
  3. Deployment terminates relevant pods
  4. NGF controller reconciles and resets .spec.replicas back to the old value (e.g., 10)
  5. Deployment spins up pods again

Impact on production:

  • Pods restart every 2 minutes (matching HPA scale-down period)
  • Thousands of websocket connections dropped on each restart
  • Connection storms after scale-downs cause metric spikes
  • HPA unable to effectively manage scaling due to constant interference

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions