Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero backup partial failure with error: /etcdserver: leader changed #8310

Open
dbebarta opened this issue Oct 16, 2024 · 2 comments
Open

Comments

@dbebarta
Copy link

We have an EKS cluster in prod where our velero backup is failing with error

Errors:
  Velero:    message: /Error listing resources error: /etcdserver: leader changed
  Cluster:    <none>
  Namespaces:
    <namespace-1>:   resource: /trafficsplits message: /Error listing items error: /etcdserver: leader changed

We connected with AWS EKS team and they mentioned there wasn't any ETCD leader changed but they identified that there was a spike in ETCDRequestsReceived activity.

We have seen this issue in Prod multiple times. Can someone please help us report it

@blackpiglet
Copy link
Contributor

This is still related to how the kube-apiserver handles Velero's requests.
Could you check the EKS control plane node's resource allocation?
We also need to know what is your backup scenario and the EKS cluster scale.

It's better to collect the debug bundle by velero debug to help investigate this issue.

@dbebarta
Copy link
Author

We have 4 clusters with each having 100+ namespaces and we take the backups of all the resources of each namespace. Each namespace has 3 PVs and around 21 pods each.

Velera backup is scheduled to run every 12hrs for each cluster

We have more than 150+ worker nodes in each clusters.

Also wanted to point out when we checked with AWS during the time when we saw the issue
`Per our internal investigations, we analyzed ETCD health and identified that there was a spike in ETCDRequestsReceived activity (spiked to ~4.3K requests).

However, overall control plane metrics show no evidence of etcd leader change occurred when investigating. Additionally, we confirmed that no control plane scaling or recycling of control plane nodes occurred during the time frame.
`
Will share the velero debug tar file for the backup after verifying the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants