Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

400/502/504 errors while doing rollout restart or rolling update #1065

Closed
bmbferreira opened this issue Nov 6, 2019 · 17 comments
Closed

400/502/504 errors while doing rollout restart or rolling update #1065

bmbferreira opened this issue Nov 6, 2019 · 17 comments

Comments

@bmbferreira
Copy link

bmbferreira commented Nov 6, 2019

Hi, I'm getting errors while doing a rolling update and I can reproduce this problem consistently doing a rollout restart. I already tried the recommendations pointed in other issues related with this problem (#814), such as adding a preStop hook to sleep for some seconds so the pods can finish the on going requests but it doesn't solve the problem.

I have also changed the configuration on the load balancer to have the health check interval and threshold-count lower than what I'm setting for the pod's readiness probe so the load balancer could stop sending requests to the pods before receiving the SIGTERM, but without success.

Currently this is the configuration for the ingress, service and deployment:

Ingress

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig":
      { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/certificate-arn: ...
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/scheme: internet-facing
    external-dns.alpha.kubernetes.io/hostname: app.dev.codacy.org, api.dev.codacy.org
    external-dns.alpha.kubernetes.io/scope: public
    kubernetes.io/ingress.class: alb
  labels:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: codacy-api
    app.kubernetes.io/version: 4.93.0-SNAPSHOT.d94f47083
    helm.sh/chart: codacy-api-4.93.0-SNAPSHOT.d94f47083
  name: codacy-api
  namespace: codacy
spec:
  rules:
  - host: app.dev.codacy.org
    http:
      paths:
      - backend:
          serviceName: ssl-redirect
          servicePort: use-annotation
        path: /*
      - backend:
          serviceName: codacy-api
          servicePort: http
        path: /*
  - host: api.dev.codacy.org
    http:
      paths:
      - backend:
          serviceName: ssl-redirect
          servicePort: use-annotation
        path: /*
      - backend:
          serviceName: codacy-api
          servicePort: http
        path: /*

Service

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
  labels:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: codacy-api
    app.kubernetes.io/version: 4.94.0-SNAPSHOT.394a06196
    helm.sh/chart: codacy-api-4.94.0-SNAPSHOT.394a06196
  name: codacy-api
  namespace: codacy
spec:
  clusterIP: 172.20.101.186
  ports:
  - name: http
    nodePort: 30057
    port: 80
    targetPort: http
  selector:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/name: codacy-api
  type: NodePort

Deployment

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "202"
  labels:
    app.kubernetes.io/instance: codacy-api
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: codacy-api
    app.kubernetes.io/version: 4.94.0-SNAPSHOT.394a06196
    helm.sh/chart: codacy-api-4.94.0-SNAPSHOT.394a06196
  name: codacy-api
  namespace: codacy
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: codacy-api
      app.kubernetes.io/name: codacy-api
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        iam.amazonaws.com/role: ...
        kubectl.kubernetes.io/restartedAt: "2019-11-06T10:54:32Z"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: codacy-api
        app.kubernetes.io/name: codacy-api
    spec:
      containers:
      - envFrom:
        - configMapRef:
            name: codacy-api
        - secretRef:
            name: codacy-api
        image: codacy/codacy-website:4.94.0-SNAPSHOT.394a06196
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          initialDelaySeconds: 75
          periodSeconds: 20
          successThreshold: 1
          timeoutSeconds: 1
        name: codacy-api
        ports:
        - containerPort: 9000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: docker-credentials
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 60

To replicate this issue I'm using fortio to call an endpoint on the application constantly during some time (like this: fortio load -a -c 8 -qps 500 -t 60s "https://app.dev.codacy.org/manual/user/project/dashboard?bid=123") and meanwhile I run kubectl rollout restart deployment/codacy-api -n codacy to restart the pods.

In the end there's some errors caused by the rollout restart:

Fortio 1.3.1 running at 500 queries per second, 4->4 procs, for 1m0s: https://app.dev.codacy.org/manual/user/project/dashboard?bid=123
22:57:41 I httprunner.go:82> Starting http test for https://app.dev.codacy.org/manual/user/project/dashboard?bid=123 with 8 threads at 500.0 qps
22:57:41 W http_client.go:136> https requested, switching to standard go client
Starting at 500 qps with 8 thread(s) [gomax 4] for 1m0s : 3750 calls each (total 30000)
22:59:25 W periodic.go:487> T001 warning only did 257 out of 3750 calls before reaching 1m0s
22:59:25 I periodic.go:533> T001 ended after 1m0.072690186s : 257 calls. qps=4.278150340933027
22:59:25 W periodic.go:487> T002 warning only did 254 out of 3750 calls before reaching 1m0s
22:59:25 I periodic.go:533> T002 ended after 1m0.109367672s : 254 calls. qps=4.225630876471816
22:59:26 W periodic.go:487> T004 warning only did 258 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T004 ended after 1m0.133407105s : 258 calls. qps=4.290460368385607
22:59:26 W periodic.go:487> T006 warning only did 244 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T006 ended after 1m0.133490304s : 244 calls. qps=4.057639075438291
22:59:26 W periodic.go:487> T005 warning only did 249 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T005 ended after 1m0.265693232s : 249 calls. qps=4.131703903934942
22:59:26 W periodic.go:487> T007 warning only did 237 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T007 ended after 1m0.271662857s : 237 calls. qps=3.932196139374884
22:59:26 W periodic.go:487> T003 warning only did 255 out of 3750 calls before reaching 1m0s
22:59:26 I periodic.go:533> T003 ended after 1m0.272313488s : 255 calls. qps=4.230798276073633
22:59:27 W periodic.go:487> T000 warning only did 220 out of 3750 calls before reaching 1m0s
22:59:27 I periodic.go:533> T000 ended after 1m1.631718364s : 220 calls. qps=3.5695905588851025
Ended after 1m1.63173938s : 1974 calls. qps=32.029
Aggregated Sleep Time : count 1974 avg -26.514521 +/- 15.48 min -58.11073874 max -0.412190249 sum -52339.6648
# range, mid point, percentile, count
>= -58.1107 <= -0.41219 , -29.2615 , 100.00, 1974
# target 50% -29.2761
WARNING 100.00% of sleep were falling behind
Aggregated Function Time : count 1974 avg 0.24452006 +/- 0.2508 min 0.053575648 max 10.063352224 sum 482.682607
# range, mid point, percentile, count
>= 0.0535756 <= 0.06 , 0.0567878 , 0.30, 6
> 0.06 <= 0.07 , 0.065 , 0.51, 4
> 0.07 <= 0.08 , 0.075 , 0.56, 1
> 0.16 <= 0.18 , 0.17 , 1.01, 9
> 0.18 <= 0.2 , 0.19 , 12.61, 229
> 0.2 <= 0.25 , 0.225 , 83.84, 1406
> 0.25 <= 0.3 , 0.275 , 93.26, 186
> 0.3 <= 0.35 , 0.325 , 96.30, 60
> 0.35 <= 0.4 , 0.375 , 97.42, 22
> 0.4 <= 0.45 , 0.425 , 98.18, 15
> 0.45 <= 0.5 , 0.475 , 99.04, 17
> 0.5 <= 0.6 , 0.55 , 99.24, 4
> 0.6 <= 0.7 , 0.65 , 99.34, 2
> 0.8 <= 0.9 , 0.85 , 99.39, 1
> 1 <= 2 , 1.5 , 99.90, 10
> 2 <= 3 , 2.5 , 99.95, 1
> 10 <= 10.0634 , 10.0317 , 100.00, 1
# target 50% 0.226245
# target 75% 0.243794
# target 90% 0.282688
# target 99% 0.497824
# target 99.9% 2.026
Sockets used: 0 (for perfect keepalive, would be 8)
Code 200 : 1956 (99.1 %)
Code 400 : 13 (0.7 %)
Code 502 : 4 (0.2 %)
Code 504 : 1 (0.1 %)
Response Header Sizes : count 1974 avg 0 +/- 0 min 0 max 0 sum 0
Response Body/Total Sizes : count 1974 avg 41758.073 +/- 3379 min 138 max 42080 sum 82430436

I always get some errors while doing restarts during this test. This is causing some troubles in our application in production when we do rolling updates.

I noticed that in the nginx ingress controller there's the proxy-next-upstream configuration to specify in which cases a request should be passed to the next server, is there any way to do this with this load balancer? Should I use nginx instead?

Thanks for the help.

@bmbferreira
Copy link
Author

Maybe related with #1064 and #976

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2020
@jebeaudet
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2020
@ghost
Copy link

ghost commented May 15, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2020
@djreed
Copy link

djreed commented Jul 10, 2020

+1 near-exact same situation

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 7, 2020
@chancez
Copy link
Contributor

chancez commented Nov 7, 2020

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2020
@kishorj
Copy link
Collaborator

kishorj commented Nov 18, 2020

Both v1 and v2 controllers supports zero downtime deployment. Need to document how to setup for zero downtime deployment.

@foriequal0
Copy link

@kishorj You mean sleep 30 on preStop lifecycle hooks?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2021
@janekbettinger
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 22, 2021
@wu0407
Copy link

wu0407 commented Jun 10, 2021

when you deregister a target from your Network Load Balancer, it is expected to take 30-90 seconds to process the requested deregistration, after which it will no longer receive new connections. During this time the Elastic Load Balancing API will report the target in 'draining' state. The target will continue to receive new connections until the deregistration processing has completed. At the end of the configured deregistration delay, the target will not be included in the describe-target-health response for the Target Group, and will return 'unused' with reason 'Target.NotRegistered' when querying for the specific target.

you need set up at least preStop sleep 90.

#1064

@foriequal0
Copy link

foriequal0 commented Jun 10, 2021

I tried to fix this PR #1775, but I've ended up with a separate package: https://github.com/foriequal0/pod-graceful-drain

@kishorj
Copy link
Collaborator

kishorj commented Jul 21, 2021

Filed a documentation issue for setting up zero downtime deployment #2131
Closing this issue.

@kishorj kishorj closed this as completed Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants