400/502/504 errors while doing rollout restart or rolling update #1065

bmbferreira · 2019-11-06T14:19:35Z

Hi, I'm getting errors while doing a rolling update and I can reproduce this problem consistently doing a rollout restart. I already tried the recommendations pointed in other issues related with this problem (#814), such as adding a preStop hook to sleep for some seconds so the pods can finish the on going requests but it doesn't solve the problem.

I have also changed the configuration on the load balancer to have the health check interval and threshold-count lower than what I'm setting for the pod's readiness probe so the load balancer could stop sending requests to the pods before receiving the SIGTERM, but without success.

Currently this is the configuration for the ingress, service and deployment:

Ingress

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig":
      { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/certificate-arn: ...
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/scheme: internet-facing
    external-dns.alpha.kubernetes.io/hostname: app.dev.codacy.org, api.dev.codacy.org
    external-dns.alpha.kubernetes.io/scope: public
    kubernetes.io/ingress.class: alb
  labels:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: codacy-api
    app.kubernetes.io/version: 4.93.0-SNAPSHOT.d94f47083
    helm.sh/chart: codacy-api-4.93.0-SNAPSHOT.d94f47083
  name: codacy-api
  namespace: codacy
spec:
  rules:
  - host: app.dev.codacy.org
    http:
      paths:
      - backend:
          serviceName: ssl-redirect
          servicePort: use-annotation
        path: /*
      - backend:
          serviceName: codacy-api
          servicePort: http
        path: /*
  - host: api.dev.codacy.org
    http:
      paths:
      - backend:
          serviceName: ssl-redirect
          servicePort: use-annotation
        path: /*
      - backend:
          serviceName: codacy-api
          servicePort: http
        path: /*

Service

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
  labels:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: codacy-api
    app.kubernetes.io/version: 4.94.0-SNAPSHOT.394a06196
    helm.sh/chart: codacy-api-4.94.0-SNAPSHOT.394a06196
  name: codacy-api
  namespace: codacy
spec:
  clusterIP: 172.20.101.186
  ports:
  - name: http
    nodePort: 30057
    port: 80
    targetPort: http
  selector:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/name: codacy-api
  type: NodePort

Deployment

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "202"
  labels:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: codacy-api
    app.kubernetes.io/version: 4.94.0-SNAPSHOT.394a06196
    helm.sh/chart: codacy-api-4.94.0-SNAPSHOT.394a06196
  name: codacy-api
  namespace: codacy
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: codacy-api
      app.kubernetes.io/name: codacy-api
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        iam.amazonaws.com/role: ...
        kubectl.kubernetes.io/restartedAt: "2019-11-06T10:54:32Z"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: codacy-api
        app.kubernetes.io/name: codacy-api
    spec:
      containers:
      - envFrom:
        - configMapRef:
            name: codacy-api
        - secretRef:
            name: codacy-api
        image: codacy/codacy-website:4.94.0-SNAPSHOT.394a06196
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          initialDelaySeconds: 75
          periodSeconds: 20
          successThreshold: 1
          timeoutSeconds: 1
        name: codacy-api
        ports:
        - containerPort: 9000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: docker-credentials
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 60

To replicate this issue I'm using fortio to call an endpoint on the application constantly during some time (like this: fortio load -a -c 8 -qps 500 -t 60s "https://app.dev.codacy.org/manual/user/project/dashboard?bid=123") and meanwhile I run kubectl rollout restart deployment/codacy-api -n codacy to restart the pods.

In the end there's some errors caused by the rollout restart:

Fortio 1.3.1 running at 500 queries per second, 4->4 procs, for 1m0s: https://app.dev.codacy.org/manual/user/project/dashboard?bid=123
22:57:41 I httprunner.go:82> Starting http test for https://app.dev.codacy.org/manual/user/project/dashboard?bid=123 with 8 threads at 500.0 qps
22:57:41 W http_client.go:136> https requested, switching to standard go client
Starting at 500 qps with 8 thread(s) [gomax 4] for 1m0s : 3750 calls each (total 30000)
22:59:25 W periodic.go:487> T001 warning only did 257 out of 3750 calls before reaching 1m0s
22:59:25 I periodic.go:533> T001 ended after 1m0.072690186s : 257 calls. qps=4.278150340933027
22:59:25 W periodic.go:487> T002 warning only did 254 out of 3750 calls before reaching 1m0s
22:59:25 I periodic.go:533> T002 ended after 1m0.109367672s : 254 calls. qps=4.225630876471816
22:59:26 W periodic.go:487> T004 warning only did 258 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T004 ended after 1m0.133407105s : 258 calls. qps=4.290460368385607
22:59:26 W periodic.go:487> T006 warning only did 244 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T006 ended after 1m0.133490304s : 244 calls. qps=4.057639075438291
22:59:26 W periodic.go:487> T005 warning only did 249 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T005 ended after 1m0.265693232s : 249 calls. qps=4.131703903934942
22:59:26 W periodic.go:487> T007 warning only did 237 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T007 ended after 1m0.271662857s : 237 calls. qps=3.932196139374884
22:59:26 W periodic.go:487> T003 warning only did 255 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T003 ended after 1m0.272313488s : 255 calls. qps=4.230798276073633
22:59:27 W periodic.go:487> T000 warning only did 220 out of 3750 calls before reaching 1m0s
22:59:27 I periodic.go:533> T000 ended after 1m1.631718364s : 220 calls. qps=3.5695905588851025
Ended after 1m1.63173938s : 1974 calls. qps=32.029
Aggregated Sleep Time : count 1974 avg -26.514521 +/- 15.48 min -58.11073874 max -0.412190249 sum -52339.6648
# range, mid point, percentile, count
>= -58.1107 <= -0.41219 , -29.2615 , 100.00, 1974
# target 50% -29.2761
WARNING 100.00% of sleep were falling behind
Aggregated Function Time : count 1974 avg 0.24452006 +/- 0.2508 min 0.053575648 max 10.063352224 sum 482.682607
# range, mid point, percentile, count
>= 0.0535756 <= 0.06 , 0.0567878 , 0.30, 6
> 0.06 <= 0.07 , 0.065 , 0.51, 4
> 0.07 <= 0.08 , 0.075 , 0.56, 1
> 0.16 <= 0.18 , 0.17 , 1.01, 9
> 0.18 <= 0.2 , 0.19 , 12.61, 229
> 0.2 <= 0.25 , 0.225 , 83.84, 1406
> 0.25 <= 0.3 , 0.275 , 93.26, 186
> 0.3 <= 0.35 , 0.325 , 96.30, 60
> 0.35 <= 0.4 , 0.375 , 97.42, 22
> 0.4 <= 0.45 , 0.425 , 98.18, 15
> 0.45 <= 0.5 , 0.475 , 99.04, 17
> 0.5 <= 0.6 , 0.55 , 99.24, 4
> 0.6 <= 0.7 , 0.65 , 99.34, 2
> 0.8 <= 0.9 , 0.85 , 99.39, 1
> 1 <= 2 , 1.5 , 99.90, 10
> 2 <= 3 , 2.5 , 99.95, 1
> 10 <= 10.0634 , 10.0317 , 100.00, 1
# target 50% 0.226245
# target 75% 0.243794
# target 90% 0.282688
# target 99% 0.497824
# target 99.9% 2.026
Sockets used: 0 (for perfect keepalive, would be 8)
Code 200 : 1956 (99.1 %)
Code 400 : 13 (0.7 %)
Code 502 : 4 (0.2 %)
Code 504 : 1 (0.1 %)
Response Header Sizes : count 1974 avg 0 +/- 0 min 0 max 0 sum 0
Response Body/Total Sizes : count 1974 avg 41758.073 +/- 3379 min 138 max 42080 sum 82430436

I always get some errors while doing restarts during this test. This is causing some troubles in our application in production when we do rolling updates.

I noticed that in the nginx ingress controller there's the proxy-next-upstream configuration to specify in which cases a request should be passed to the next server, is there any way to do this with this load balancer? Should I use nginx instead?

Thanks for the help.

The text was updated successfully, but these errors were encountered:

bmbferreira · 2019-11-06T14:21:41Z

Maybe related with #1064 and #976

fejta-bot · 2020-02-04T15:47:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jebeaudet · 2020-02-04T15:48:47Z

/remove-lifecycle stale

fejta-bot · 2020-05-04T16:55:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ghost · 2020-05-15T00:22:15Z

/remove-lifecycle stale

djreed · 2020-07-10T16:09:45Z

+1 near-exact same situation

fejta-bot · 2020-10-08T17:08:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-11-07T17:51:12Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

chancez · 2020-11-07T20:16:05Z

/remove-lifecycle rotten

kishorj · 2020-11-18T22:04:56Z

Both v1 and v2 controllers supports zero downtime deployment. Need to document how to setup for zero downtime deployment.

foriequal0 · 2021-01-18T16:16:24Z

@kishorj You mean sleep 30 on preStop lifecycle hooks?

fejta-bot · 2021-04-18T16:48:09Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-05-18T17:11:20Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

janekbettinger · 2021-05-22T14:31:02Z

/remove-lifecycle rotten

wu0407 · 2021-06-10T03:06:31Z

when you deregister a target from your Network Load Balancer, it is expected to take 30-90 seconds to process the requested deregistration, after which it will no longer receive new connections. During this time the Elastic Load Balancing API will report the target in 'draining' state. The target will continue to receive new connections until the deregistration processing has completed. At the end of the configured deregistration delay, the target will not be included in the describe-target-health response for the Target Group, and will return 'unused' with reason 'Target.NotRegistered' when querying for the specific target.

you need set up at least preStop sleep 90.

#1064

foriequal0 · 2021-06-10T03:21:35Z

I tried to fix this PR #1775, but I've ended up with a separate package: https://github.com/foriequal0/pod-graceful-drain

kishorj · 2021-07-21T22:18:52Z

Filed a documentation issue for setting up zero downtime deployment #2131
Closing this issue.

ihcsim mentioned this issue Dec 10, 2019

Configure linkerd-proxy to ignore SIGTERM on a per-workload basis linkerd/linkerd2#3747

Closed

jorihardman mentioned this issue Jan 16, 2020

Ingress controller did not remove targets from target group even when pods were deleted #1131

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2020

runningman84 mentioned this issue Apr 27, 2020

50x errors due to pods termination #1237

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 7, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2020

This was referenced Jan 20, 2021

Create ability to do zero downtime deployments when using externalTrafficPolicy: Local kubernetes/kubernetes#85643

Closed

Delay pod deletion to handle deregistraton delay #1775

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 22, 2021

kishorj closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

400/502/504 errors while doing rollout restart or rolling update #1065

400/502/504 errors while doing rollout restart or rolling update #1065

bmbferreira commented Nov 6, 2019 •

edited

Loading

bmbferreira commented Nov 6, 2019

fejta-bot commented Feb 4, 2020

jebeaudet commented Feb 4, 2020

fejta-bot commented May 4, 2020

ghost commented May 15, 2020

djreed commented Jul 10, 2020

fejta-bot commented Oct 8, 2020

fejta-bot commented Nov 7, 2020

chancez commented Nov 7, 2020

kishorj commented Nov 18, 2020 •

edited

Loading

foriequal0 commented Jan 18, 2021

fejta-bot commented Apr 18, 2021

fejta-bot commented May 18, 2021

janekbettinger commented May 22, 2021

wu0407 commented Jun 10, 2021

foriequal0 commented Jun 10, 2021 •

edited

Loading

kishorj commented Jul 21, 2021 •

edited

Loading

400/502/504 errors while doing rollout restart or rolling update #1065

400/502/504 errors while doing rollout restart or rolling update #1065

Comments

bmbferreira commented Nov 6, 2019 • edited Loading

Ingress

Service

Deployment

bmbferreira commented Nov 6, 2019

fejta-bot commented Feb 4, 2020

jebeaudet commented Feb 4, 2020

fejta-bot commented May 4, 2020

ghost commented May 15, 2020

djreed commented Jul 10, 2020

fejta-bot commented Oct 8, 2020

fejta-bot commented Nov 7, 2020

chancez commented Nov 7, 2020

kishorj commented Nov 18, 2020 • edited Loading

foriequal0 commented Jan 18, 2021

fejta-bot commented Apr 18, 2021

fejta-bot commented May 18, 2021

janekbettinger commented May 22, 2021

wu0407 commented Jun 10, 2021

foriequal0 commented Jun 10, 2021 • edited Loading

kishorj commented Jul 21, 2021 • edited Loading

bmbferreira commented Nov 6, 2019 •

edited

Loading

kishorj commented Nov 18, 2020 •

edited

Loading

foriequal0 commented Jun 10, 2021 •

edited

Loading

kishorj commented Jul 21, 2021 •

edited

Loading