-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed to fail-over resource kubernetes Version: v1.30.0 #59
Comments
This does not seem related to kubernetes 1.30 or even the HA Controller. It looks like the node
At which point it did not encounter any issues. Failover times can be influenced by the amount of work the rest of the cluster has to handle. If a master node fails, failover times can be slower, as the k8s API in general gets slower. |
感谢回复,我配置的是3个master节点的集群,piraeus-operator按doc进行的安装配置,模拟了一个节点故障(shutdown或拔掉网线,k8s集群依然可用)但是piraeus无法完成故障转移,无限evicting,已经超过24分钟。稍后我用worker节点测试一下。 root@master20:~#linstor v l root@master21:~# poweroff root@master20:~# kubectl get no root@master22:~# linstor v l root@master20:~# drbdadm status root@master22: root@master22:~# kubectl logs -n piraeus-datastore ha-controller-z59b2 |
when run on a node not master ,it cannot failover either!why? when i force delete the Terminating pod it can schedule to Secondary node and start running! |
|
You could try turning up the verbosity of the HA Controller to see what it tries to do. Edit the ...
spec:
highAvailabilityController:
podTemplate:
spec:
containers:
- name: ha-controller
args:
- /agent
- --v=3 |
Pod 'default/test-sts-web-0' is exempt from eviction because of unsafe volumes What does it mean?
|
Because the pod has a hostPath volume mounted, the HA Controller believes it can't fail over this volume. See https://github.com/piraeusdatastore/piraeus-ha-controller/blob/main/pkg/agent/reconcile_failover.go#L262-L296 Why? Because if you had a host path volume and you evicted the Pod and it starts on another node, that volume has now different content. At least that was the idea: only fail over Pods that only have "safe" volumes, i.e. DRBD volume or other ephemeral volumes. Looks like in this case it would also be safe, as the /etc/localtime is readOnly... Perhaps we can improve that check. You can try running the ha controller with |
thank you! it can failover when i remove localtime volume! I unplug the network cable to simulate server down, and then plug it back in after a while to restore the network, i expect the primary to become secondary,but it doesn't。so i reboot the server,then it become secondary! how it can auto change primary to secondary after network restored,no need to reboot server!
|
The HA Controller on the "old" Primary node should see that a Pod is stuck in suspend-io and force it to become secondary using |
|
Sorry, should have been |
Can it reconnect and recovery be done automatically? |
I0423 06:45:42.066690 1 agent.go:253] starting reconciliation
I0423 06:45:52.066708 1 agent.go:253] starting reconciliation
I0423 06:45:52.066824 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting
W0423 06:46:05.985734 1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0423 06:46:05.985738 1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.VolumeAttachment ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0423 06:46:05.985770 1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0423 06:46:05.985738 1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
E0423 06:46:05.985948 1 reconcile_failover.go:141] "failed to fail-over resource" err="failed to apply node taint: Put "https://10.96.0.1:443/api/v1/nodes/master22.host?fieldManager=linstor.linbit.com%2Fhigh-availability-controller%2Fv2\": http2: client connection lost"
I0423 06:46:05.986006 1 agent.go:253] starting reconciliation
I0423 06:46:05.986111 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting
E0423 06:46:05.997341 1 reconcile_failover.go:141] "failed to fail-over resource" err="failed force detach: volumeattachments.storage.k8s.io "csi-28b5875796ad4197fe5c795c0ce064930dc9536179e69c3d0edaaf92121ee99b" not found"
I0423 06:46:12.066698 1 agent.go:253] starting reconciliation
I0423 06:46:12.066840 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting
I0423 06:46:22.067170 1 agent.go:253] starting reconciliation
I0423 06:46:22.067312 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting
I0
The text was updated successfully, but these errors were encountered: