NLB PodReadinessGate instability #4034

ajax-suprun-r · 2025-01-24T15:19:11Z

Describe the bug
A concise description of what the bug is.

We are using Argo Rollout canary strategy with NLB operated on Service level, during deployment of new version first pod is missing readiness gate, second one that schedules in the same service does have readiness gate. Looks like ALB controller misses injection for the first pod.

On the screenshot you can see that the first(older) pod goes without readiness gate, next one does have readiness gate.

Steps to reproduce

Use ArgoCD and Argo Rollouts as delivery system.
Use 2 replicas
Configure canary strategy with:

steps:
  - setWeight: 50
  - pause:
      duration: 30s
  - setWeight: 100

Make some deployment with changing app version
Monitor pod readiness gate status with: kubectl get pod -l 'app.kubernetes.io/name=<app_name>' -o wide -n <namespace>

Expected outcome
A concise description of what you expected to happen.
Readiness gate applies to first pod during rollout process.

Environment

Services types are: LoadBalancer

Annotations and ports configuration:

As a result: 2 loadbalancers are being created, 2 services and 6 target group bindings

AWS Load Balancer controller version: 2.10.1
Kubernetes version: 1.31
Using EKS (yes/no), if so version? eks.16

Additional Context:

The text was updated successfully, but these errors were encountered:

zac-nixon · 2025-01-24T19:54:27Z

How are you creating the namespace? It's very important that the order of resource creation goes:

Namespace
Label namespace
Perform other deployment actions

ajax-suprun-r · 2025-01-24T19:57:58Z

How are you creating the namespace? It's very important that the order of resource creation goes:

Namespace

Label namespace

Perform other deployment actions

Yeah, we are creating namespace before everything else with proper label elbv2.k8s.aws/pod-readiness-gate-inject: enabled and only after that we're trying to launch our workload

In addition to that, we have canary deployment with ALB in the same namespace and everything works well, so it's definitely not a namespace issue

zac-nixon · 2025-01-24T20:36:41Z

What is the order in which you create the deployment / (service or ingress)?

Basically, if you create the deployment first, then your initial pods could come up with no readiness gates as at creation time they were not associated to a Load Balancer. If you create the ingress or service then create the deployment then when your pods are created they are already associated to the load balancer and hence will have readiness gates attached.

These are the scenarios I tested with an NLB:

The initial pods came up with no readiness check.

Create deployment.
Create SVC that references the deployment.

The initial pods came up with a readiness check

Create SVC that references that deployment in (2)
Create deployment.

ajax-suprun-r · 2025-01-27T15:10:51Z

@zac-nixon i've tested both approaches, and results are the same, we are operating with replicasets and SVC.

SVC are being created first and after that replica sets are being created, in each new deployment of service new replica set is created and SVC not changing at all.

Tried both scenarios, and result is the same: first pod in the same replica set is not being injected with readiness gate

When i'm restarting pods in the same replica set all pods are being applied with readiness gate, the problem comes up only when new replica set is created, all other objects (including SVC) are not changed or recreated

zac-nixon · 2025-01-27T23:39:16Z

I think I see the issue, and it relates to how Kubernetes handles eventually consistency.

I can repro the same behavior:

  kubectl create namespace nlb-game-2048-5
  kubectl label namespace nlb-game-2048-5 elbv2.k8s.aws/pod-readiness-gate-inject=enabled
  kubectl apply -f /tmp/svc.yaml
  kubectl apply -f /tmp/rs.yaml

svc.yaml

apiVersion: v1
kind: Service
metadata:
  namespace: nlb-game-2048-5
  name: repro
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  type: LoadBalancer
  loadBalancerClass: service.k8s.aws/nlb
  selector:
    tier: frontend

rs.yaml

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: repro
  namespace: nlb-game-2048-5
  labels:
    app: guestbook
    tier: frontend
spec:
  # modify replicas according to your case
  replicas: 3
  selector:
    matchLabels:
      tier: frontend
  template:
    metadata:
      labels:
        tier: frontend
    spec:
      containers:
      - name: php-redis
        image: public.ecr.aws/l6m2t8p7/docker-2048:latest

The pods come up with no readiness gates attached.

The issue is because of eventual consistency within Kubernetes and the LBC. I was able to solve this issue by adding a sleep between each command, to ensure the LBC has a correctly warmed cache before processing each operation.

  kubectl create namespace nlb-game-2048-5
  sleep 5
  kubectl label namespace nlb-game-2048-5 elbv2.k8s.aws/pod-readiness-gate-inject=enabled
  sleep 5
  kubectl apply -f /tmp/svc.yaml
  sleep 5
  kubectl apply -f /tmp/rs.yaml

I know it's not a great solution but there is some pretty serious architectural limitations at play. Can you try with some time between each of the operations?

ajax-suprun-r · 2025-01-29T09:33:53Z

@zac-nixon hi, this workaround works fine, it's not the best solution, but it works.
You can close this issue if you don't plan to fix this behavior

zac-nixon · 2025-01-29T17:16:22Z

I'm in favor of leaving this open as it's a legitimate issue although I do not have the time to work on a proper fix atm.

zac-nixon self-assigned this Jan 24, 2025

zac-nixon added the kind/bug label Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLB PodReadinessGate instability #4034

NLB PodReadinessGate instability #4034

ajax-suprun-r commented Jan 24, 2025

zac-nixon commented Jan 24, 2025 •

edited

Loading

ajax-suprun-r commented Jan 24, 2025 •

edited

Loading

zac-nixon commented Jan 24, 2025

ajax-suprun-r commented Jan 27, 2025

zac-nixon commented Jan 27, 2025

ajax-suprun-r commented Jan 29, 2025

zac-nixon commented Jan 29, 2025

NLB PodReadinessGate instability #4034

NLB PodReadinessGate instability #4034

Comments

ajax-suprun-r commented Jan 24, 2025

zac-nixon commented Jan 24, 2025 • edited Loading

ajax-suprun-r commented Jan 24, 2025 • edited Loading

zac-nixon commented Jan 24, 2025

ajax-suprun-r commented Jan 27, 2025

zac-nixon commented Jan 27, 2025

ajax-suprun-r commented Jan 29, 2025

zac-nixon commented Jan 29, 2025

zac-nixon commented Jan 24, 2025 •

edited

Loading

ajax-suprun-r commented Jan 24, 2025 •

edited

Loading