Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Backup will fail when the backuppolicy label selector didn't catch up cluster role change #4962

Closed
ahjing99 opened this issue Sep 1, 2023 · 2 comments
Assignees
Labels
bug kind/bug Something isn't working Stale
Milestone

Comments

@ahjing99
Copy link
Collaborator

ahjing99 commented Sep 1, 2023

➜ ~ kbcli version
Kubernetes: v1.27.3-gke.100
KubeBlocks: 0.7.0-alpha.4
kbcli: 0.7.0-alpha.4

Due to issue #4959, the role of the cluster's only pod changed from leader to follower to none, and seems the backuppolicy also changed accordingly, but it does not change immediately when the role changed. which cause the backup will failed if do the operation during the gap period

The backuppolicy labelselector changed:

img_v2_56065fc0-c4b5-47de-aa67-f53cf19fd9bg

img_v2_b5df7f46-f8d1-4443-9412-059d92be9d9g

Backup during the gap period will fail

Warning  ApplyResourcesFailed      9m45s                 cluster-controller         error invoking binding mysql/leaveMember: rpc error: code = Unavailable desc = error reading from server: EOF
  Normal   BackupJobCreate           6m3s (x2 over 13m)    cluster-controller         Create backupJob/mysqltest2-mysql-scaling
  Warning  Unhealthy                 92s (x4 over 6m32s)   event-controller           Pod mysqltest2-mysql-0: Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"candidate","role":"candidate"}
  Warning  BackupFailed              92s (x18 over 5m59s)  cluster-controller         backup for horizontalScaling failed: can not find any pod to backup by labelsSelector
@ahjing99 ahjing99 added the kind/bug Something isn't working label Sep 1, 2023
@ahjing99 ahjing99 added this to the Release 0.7.0 milestone Sep 1, 2023
@ahjing99 ahjing99 changed the title [BUG]Backup will fail when the backuppolicy label selector didn't cache up cluster role change [BUG]Backup will fail when the backuppolicy label selector didn't catch up cluster role change Sep 1, 2023
@github-actions
Copy link

github-actions bot commented Oct 2, 2023

This issue has been marked as stale because it has been open for 30 days with no activity

@github-actions github-actions bot added the Stale label Oct 2, 2023
@ldming
Copy link
Collaborator

ldming commented Oct 23, 2023

This is by design. The cluster role is not labeled, so the backup controller can not find the target pod to back up.

Once the cluster restores to normal and has the proper role label, backups can proceed normally. The current backup fault tolerance is insufficient - a retry mechanism needs to be added for failures going forward.

@ldming ldming closed this as completed Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug kind/bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

3 participants