Velero backup partial failure with error: /etcdserver: leader changed #8310

dbebarta · 2024-10-16T06:12:15Z

We have an EKS cluster in prod where our velero backup is failing with error

Errors:
  Velero:    message: /Error listing resources error: /etcdserver: leader changed
  Cluster:    <none>
  Namespaces:
    <namespace-1>:   resource: /trafficsplits message: /Error listing items error: /etcdserver: leader changed

We connected with AWS EKS team and they mentioned there wasn't any ETCD leader changed but they identified that there was a spike in ETCDRequestsReceived activity.

We have seen this issue in Prod multiple times. Can someone please help us report it

The text was updated successfully, but these errors were encountered:

blackpiglet · 2024-10-17T06:04:16Z

This is still related to how the kube-apiserver handles Velero's requests.
Could you check the EKS control plane node's resource allocation?
We also need to know what is your backup scenario and the EKS cluster scale.

It's better to collect the debug bundle by velero debug to help investigate this issue.

dbebarta · 2024-10-18T15:08:44Z

We have 4 clusters with each having 100+ namespaces and we take the backups of all the resources of each namespace. Each namespace has 3 PVs and around 21 pods each.

Velera backup is scheduled to run every 12hrs for each cluster

We have more than 150+ worker nodes in each clusters.

Also wanted to point out when we checked with AWS during the time when we saw the issue
`Per our internal investigations, we analyzed ETCD health and identified that there was a spike in ETCDRequestsReceived activity (spiked to ~4.3K requests).

However, overall control plane metrics show no evidence of etcd leader change occurred when investigating. Additionally, we confirmed that no control plane scaling or recycling of control plane nodes occurred during the time frame.
`
Will share the velero debug tar file for the backup after verifying the data.

blackpiglet added the Area/Cloud/AWS label Oct 17, 2024

blackpiglet self-assigned this Oct 21, 2024

reasonerjt added the Needs investigation label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero backup partial failure with error: /etcdserver: leader changed #8310

Velero backup partial failure with error: /etcdserver: leader changed #8310

dbebarta commented Oct 16, 2024

blackpiglet commented Oct 17, 2024

dbebarta commented Oct 18, 2024

Velero backup partial failure with error: /etcdserver: leader changed #8310

Velero backup partial failure with error: /etcdserver: leader changed #8310

Comments

dbebarta commented Oct 16, 2024

blackpiglet commented Oct 17, 2024

dbebarta commented Oct 18, 2024