[BUG]Backup will fail when the backuppolicy label selector didn't catch up cluster role change #4962

ahjing99 · 2023-09-01T07:10:03Z

➜ ~ kbcli version
Kubernetes: v1.27.3-gke.100
KubeBlocks: 0.7.0-alpha.4
kbcli: 0.7.0-alpha.4

Due to issue #4959, the role of the cluster's only pod changed from leader to follower to none, and seems the backuppolicy also changed accordingly, but it does not change immediately when the role changed. which cause the backup will failed if do the operation during the gap period

The backuppolicy labelselector changed:

Backup during the gap period will fail

Warning  ApplyResourcesFailed      9m45s                 cluster-controller         error invoking binding mysql/leaveMember: rpc error: code = Unavailable desc = error reading from server: EOF
  Normal   BackupJobCreate           6m3s (x2 over 13m)    cluster-controller         Create backupJob/mysqltest2-mysql-scaling
  Warning  Unhealthy                 92s (x4 over 6m32s)   event-controller           Pod mysqltest2-mysql-0: Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"candidate","role":"candidate"}
  Warning  BackupFailed              92s (x18 over 5m59s)  cluster-controller         backup for horizontalScaling failed: can not find any pod to backup by labelsSelector

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-02T00:16:35Z

This issue has been marked as stale because it has been open for 30 days with no activity

ldming · 2023-10-23T07:54:55Z

This is by design. The cluster role is not labeled, so the backup controller can not find the target pod to back up.

Once the cluster restores to normal and has the proper role label, backups can proceed normally. The current backup fault tolerance is insufficient - a retry mechanism needs to be added for failures going forward.

ahjing99 added the kind/bug Something isn't working label Sep 1, 2023

ahjing99 added this to the Release 0.7.0 milestone Sep 1, 2023

ahjing99 assigned ldming Sep 1, 2023

apecloud-bot added the bug label Sep 1, 2023

ahjing99 changed the title ~~[BUG]Backup will fail when the backuppolicy label selector didn't cache up cluster role change~~ [BUG]Backup will fail when the backuppolicy label selector didn't catch up cluster role change Sep 1, 2023

github-actions bot added the Stale label Oct 2, 2023

ldming closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Backup will fail when the backuppolicy label selector didn't catch up cluster role change #4962

[BUG]Backup will fail when the backuppolicy label selector didn't catch up cluster role change #4962

ahjing99 commented Sep 1, 2023

github-actions bot commented Oct 2, 2023

ldming commented Oct 23, 2023

[BUG]Backup will fail when the backuppolicy label selector didn't catch up cluster role change #4962

[BUG]Backup will fail when the backuppolicy label selector didn't catch up cluster role change #4962

Comments

ahjing99 commented Sep 1, 2023

github-actions bot commented Oct 2, 2023

ldming commented Oct 23, 2023