Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to fail-over resource kubernetes Version: v1.30.0 #59

Open
yeshl opened this issue Apr 23, 2024 · 12 comments
Open

failed to fail-over resource kubernetes Version: v1.30.0 #59

yeshl opened this issue Apr 23, 2024 · 12 comments

Comments

@yeshl
Copy link

yeshl commented Apr 23, 2024

I0423 06:45:42.066690 1 agent.go:253] starting reconciliation
I0423 06:45:52.066708 1 agent.go:253] starting reconciliation
I0423 06:45:52.066824 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting
W0423 06:46:05.985734 1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0423 06:46:05.985738 1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.VolumeAttachment ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0423 06:46:05.985770 1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0423 06:46:05.985738 1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
E0423 06:46:05.985948 1 reconcile_failover.go:141] "failed to fail-over resource" err="failed to apply node taint: Put "https://10.96.0.1:443/api/v1/nodes/master22.host?fieldManager=linstor.linbit.com%2Fhigh-availability-controller%2Fv2\": http2: client connection lost"
I0423 06:46:05.986006 1 agent.go:253] starting reconciliation
I0423 06:46:05.986111 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting
E0423 06:46:05.997341 1 reconcile_failover.go:141] "failed to fail-over resource" err="failed force detach: volumeattachments.storage.k8s.io "csi-28b5875796ad4197fe5c795c0ce064930dc9536179e69c3d0edaaf92121ee99b" not found"
I0423 06:46:12.066698 1 agent.go:253] starting reconciliation
I0423 06:46:12.066840 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting
I0423 06:46:22.067170 1 agent.go:253] starting reconciliation
I0423 06:46:22.067312 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting
I0

@WanzenBug
Copy link
Member

unable to decode an event from the watch stream: http2: client connection lost

This does not seem related to kubernetes 1.30 or even the HA Controller. It looks like the node master22.host went away, which was probably also hosting the the Kubernetes control plane. The HA Controller will simply retry later, which it indeed did:

I0423 06:46:12.066698 1 agent.go:253] starting reconciliation
I0423 06:46:12.066840 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting
I0423 06:46:22.067170 1 agent.go:253] starting reconciliation
I0423 06:46:22.067312 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master22.host' has failed, evicting

At which point it did not encounter any issues. Failover times can be influenced by the amount of work the rest of the cluster has to handle. If a master node fails, failover times can be slower, as the k8s API in general gets slower.

@yeshl
Copy link
Author

yeshl commented Apr 24, 2024

感谢回复,我配置的是3个master节点的集群,piraeus-operator按doc进行的安装配置,模拟了一个节点故障(shutdown或拔掉网线,k8s集群依然可用)但是piraeus无法完成故障转移,无限evicting,已经超过24分钟。稍后我用worker节点测试一下。

root@master20:~#linstor v l
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Node | Resource | StoragePool | VolNr | MinorNr | DeviceName | Allocated | InUse | State |
|=====================================================================================================================================================|
| master20.host | pvc-5402447b-9617-4764-902b-93ae4cea6106 | DfltDisklessStorPool | 0 | 1000 | /dev/drbd1000 | | Unused | TieBreaker |
| master21.host | pvc-5402447b-9617-4764-902b-93ae4cea6106 | pool-01 | 0 | 1000 | /dev/drbd1000 | 65.21 MiB | InUse | UpToDate |
| master22.host | pvc-5402447b-9617-4764-902b-93ae4cea6106 | pool-01 | 0 | 1000 | /dev/drbd1000 | 65.34 MiB | Unused | UpToDate |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+

root@master21:~# poweroff

root@master20:~# kubectl get no
NAME STATUS ROLES AGE VERSION
master20.host Ready control-plane 2d12h v1.30.0
master21.host NotReady control-plane 2d11h v1.30.0
master22.host Ready control-plane 2d11h v1.30.0

root@master22:~# linstor v l
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Node | Resource | StoragePool | VolNr | MinorNr | DeviceName | Allocated | InUse | State |
|=====================================================================================================================================================|
| master20.host | pvc-5402447b-9617-4764-902b-93ae4cea6106 | DfltDisklessStorPool | 0 | 1000 | /dev/drbd1000 | | Unused | TieBreaker |
| master21.host | pvc-5402447b-9617-4764-902b-93ae4cea6106 | pool-01 | 0 | 1000 | /dev/drbd1000 | 2.00 GiB | | Unknown |
| master22.host | pvc-5402447b-9617-4764-902b-93ae4cea6106 | pool-01 | 0 | 1000 | /dev/drbd1000 | 65.34 MiB | Unused | UpToDate |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+

root@master20:~# drbdadm status
pvc-5402447b-9617-4764-902b-93ae4cea6106 role:Secondary
disk:Diskless
master21.host connection:Connecting
master22.host role:Secondary
peer-disk:UpToDate

root@master22:# drbdadm status
pvc-5402447b-9617-4764-902b-93ae4cea6106 role:Secondary
disk:UpToDate
master20.host role:Secondary
peer-disk:Diskless
master21.host connection:Connecting
root@master22:
# kubectl get pod -n piraeus-datastore -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ha-controller-7bdmq 1/1 Running 4 (16h ago) 35h 10.244.2.138 master22.host
ha-controller-g2hlz 1/1 Running 3 (14h ago) 35h 10.244.1.26 master21.host
ha-controller-z59b2 1/1 Running 1 (20h ago) 35h 10.244.0.250 master20.host
root@master22:~# kubectl logs -n piraeus-datastore ha-controller-7bdmq
I0423 23:57:52.307596 1 agent.go:253] starting reconciliation
I0423 23:58:02.307055 1 agent.go:253] starting reconciliation
I0423 23:58:12.307170 1 agent.go:253] starting reconciliation
I0423 23:58:22.307161 1 agent.go:253] starting reconciliation
I0423 23:58:22.307298 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0423 23:58:32.307523 1 agent.go:253] starting reconciliation
I0423 23:58:32.307664 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0423 23:58:42.307155 1 agent.go:253] starting reconciliation
I0423 23:58:42.307310 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
...省略
I0424 00:21:52.307448 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0424 00:22:02.307098 1 agent.go:253] starting reconciliation
I0424 00:22:02.307237 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0424 00:22:12.306998 1 agent.go:253] starting reconciliation
I0424 00:22:12.307112 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting

root@master22:~# kubectl logs -n piraeus-datastore ha-controller-z59b2
I0423 23:58:02.067259 1 agent.go:253] starting reconciliation
I0423 23:58:12.067417 1 agent.go:253] starting reconciliation
I0423 23:58:22.067711 1 agent.go:253] starting reconciliation
I0423 23:58:22.067900 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0423 23:58:32.067219 1 agent.go:253] starting reconciliation
I0423 23:58:32.067359 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0423 23:58:42.066787 1 agent.go:253] starting reconciliation
I0423 23:58:42.066917 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0423 23:58:52.066917 1 agent.go:253] starting reconciliation
I0423 23:58:52.067053 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0423 23:59:02.067629 1 agent.go:253] starting reconciliation
...省略
I0424 00:24:02.067200 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0424 00:24:12.067010 1 agent.go:253] starting reconciliation
I0424 00:24:12.067144 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0424 00:24:22.067629 1 agent.go:253] starting reconciliation
I0424 00:24:22.067752 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting
I0424 00:24:32.067570 1 agent.go:253] starting reconciliation
I0424 00:24:32.067823 1 reconcile_failover.go:137] resource 'pvc-5402447b-9617-4764-902b-93ae4cea6106' on node 'master21.host' has failed, evicting

@yeshl
Copy link
Author

yeshl commented Apr 25, 2024

when run on a node not master ,it cannot failover either!why? when i force delete the Terminating pod it can schedule to Secondary node and start running!
I0425 07:09:31.328542 1 agent.go:253] starting reconciliation
I0425 07:09:31.328717 1 reconcile_failover.go:137] resource 'pvc-a28ab865-bf44-4c53-9408-303694756133' on node 'node50.host' has failed, evicting
I0425 07:09:41.328786 1 agent.go:253] starting reconciliation
I0425 07:09:41.328921 1 reconcile_failover.go:137] resource 'pvc-a28ab865-bf44-4c53-9408-303694756133' on node 'node50.host' has failed, evicting
I0425 07:09:51.328672 1 agent.go:253] starting reconciliation
I0425 07:09:51.328824 1 reconcile_failover.go:137] resource 'pvc-a28ab865-bf44-4c53-9408-303694756133' on node 'node50.host' has failed, evicting
I0425 07:10:01.329536 1 agent.go:253] starting reconciliation
I0425 07:10:01.329677 1 reconcile_failover.go:137] resource 'pvc-a28ab865-bf44-4c53-9408-303694756133' on node 'node50.host' has failed, evicting

@yeshl
Copy link
Author

yeshl commented Apr 25, 2024

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: sc-piraeus-r2-ha
provisioner: linstor.csi.linbit.com
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
  csi.storage.k8s.io/fstype: xfs
  linstor.csi.linbit.com/storagePool: pool-01
  linstor.csi.linbit.com/placementCount: "2"
  linstor.csi.linbit.com/allowRemoteVolumeAccess: "false"
  property.linstor.csi.linbit.com/DrbdOptions/auto-quorum: suspend-io
  property.linstor.csi.linbit.com/DrbdOptions/Resource/on-no-data-accessible: suspend-io
  property.linstor.csi.linbit.com/DrbdOptions/Resource/on-suspended-primary-outdated: force-secondary
  property.linstor.csi.linbit.com/DrbdOptions/Net/rr-conflict: retry-connect
---
apiVersion: v1
kind: Service
metadata:
  name: test-svc-web
spec:
  ports:
    - port: 80
      name: web
  clusterIP: None
  selector:
    app: web
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-sts-web
spec:
  selector:
    matchLabels:
      app: web
  serviceName: "test-svc-web"
  replicas: 1
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.25.5-alpine
          ports:
            - containerPort: 80
          volumeMounts:
            - name: pvc
              mountPath: /mnt/data
  volumeClaimTemplates:
    - metadata:
        name: pvc
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: sc-piraeus-r2-ha
        resources:
          requests:
            storage: 2Gi

@WanzenBug
Copy link
Member

You could try turning up the verbosity of the HA Controller to see what it tries to do. Edit the LinstorCluster resource to contain:

...
spec:
  highAvailabilityController:
    podTemplate:
      spec:
        containers:
        - name: ha-controller
          args:
          - /agent
          - --v=3   

@yeshl
Copy link
Author

yeshl commented Apr 25, 2024

Pod 'default/test-sts-web-0' is exempt from eviction because of unsafe volumes What does it mean?

containers:
        - name: c-web-server
          image: busybox
          imagePullPolicy: IfNotPresent #default Always
          env:
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: SVC_NAME
              value: "svc-headless"
            - name: DEFAULT_TZ
              value: "Asia/Shanghai"
          command:
            - sh
            - '-c'
            - |-
              trap 'exit 0' SIGTERM
              #rm /mnt/data/index.html
              while true; do
                echo [$(date "+%Y-%m-%d %T")] - $HOSTNAME - $POD_IP '<br>'  |tee -a /mnt/data/index.html
                #touch  /mnt/data/f-$(date +"%Y-%m-%d_%H-%M-%S").txt
                sleep 10
              done
          volumeMounts:
            - name: localtime
              mountPath: /etc/localtime
              readOnly: true
            - name: pvc
              mountPath: /mnt/data
        - name: nginx
          image: nginx:1.25.5-alpine
          env:
            - name: TZ
              value: "Asia/Shanghai"
          ports:
            - containerPort: 80
          volumeMounts:
            - name: conf
              mountPath: /etc/nginx/conf.d/default.conf
              subPath: fileserver.conf
            - name: pvc
              mountPath: /mnt/data
      volumes:
        - name: localtime
          hostPath:
            type: File
            path: /etc/localtime
        - name: conf
          configMap:
            name: test-cm-nginx
#            defaultMode: 0755
  volumeClaimTemplates:
    - metadata:
        name: pvc
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: sc-piraeus-r2-ha
        resources:
          requests:
            storage: 2Gi

@WanzenBug
Copy link
Member

Because the pod has a hostPath volume mounted, the HA Controller believes it can't fail over this volume. See https://github.com/piraeusdatastore/piraeus-ha-controller/blob/main/pkg/agent/reconcile_failover.go#L262-L296

Why? Because if you had a host path volume and you evicted the Pod and it starts on another node, that volume has now different content. At least that was the idea: only fail over Pods that only have "safe" volumes, i.e. DRBD volume or other ephemeral volumes.

Looks like in this case it would also be safe, as the /etc/localtime is readOnly... Perhaps we can improve that check.

You can try running the ha controller with --fail-over-unsafe-pods, see if it works then.

@yeshl
Copy link
Author

yeshl commented Apr 25, 2024

thank you! it can failover when i remove localtime volume! I unplug the network cable to simulate server down, and then plug it back in after a while to restore the network, i expect the primary to become secondary,but it doesn't。so i reboot the server,then it become secondary! how it can auto change primary to secondary after network restored,no need to reboot server!

+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Node          | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    | Allocated | InUse  |      State |
|=====================================================================================================================================================|
| master20.host | pvc-f7d167b9-f486-4cc8-8281-4e7d304819c4 | pool-01              |     0 |    1000 | /dev/drbd1000 |  2.62 MiB | InUse  |   UpToDate |
| master21.host | pvc-f7d167b9-f486-4cc8-8281-4e7d304819c4 | DfltDisklessStorPool |     0 |    1000 | /dev/drbd1000 |           | Unused | TieBreaker |
| master22.host | pvc-f7d167b9-f486-4cc8-8281-4e7d304819c4 | pool-01              |     0 |    1000 | /dev/drbd1000 | 64.80 MiB |        |    Unknown |

@WanzenBug
Copy link
Member

The HA Controller on the "old" Primary node should see that a Pod is stuck in suspend-io and force it to become secondary using drbdadm secondary --force.

@yeshl
Copy link
Author

yeshl commented Apr 26, 2024

+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Node          | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    | Allocated | InUse  |      State |
|=====================================================================================================================================================|
| master20.host | pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9 | DfltDisklessStorPool |     0 |    1000 | /dev/drbd1000 |           | Unused | TieBreaker |
| node50.host   | pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9 | pool-01              |     0 |    1000 | /dev/drbd1000 | 64.80 MiB |        |    Unknown |
| node51.host   | pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9 | pool-01              |     0 |    1000 | /dev/drbd1000 |  2.62 MiB | InUse  |   UpToDate |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
root@master20:~# kubectl get po -n piraeus-datastore -o wide|grep ha
ha-controller-6zlcf                                    1/1     Running   0          27m   10.244.2.250   master22.host   <none>           <none>
ha-controller-7fjnl                                    1/1     Running   0          27m   10.244.4.83    node51.host     <none>           <none>
ha-controller-7p82f                                    1/1     Running   0          27m   10.244.0.118   master20.host   <none>           <none>
ha-controller-bb47w                                    1/1     Running   0          27m   10.244.3.130   node50.host     <none>           <none>
ha-controller-ltjjt                                    1/1     Running   0          27m   10.244.1.126   master21.host   <none>           <none>
root@master20:~# kubectl  -n piraeus-datastore exec ha-controller-bb47w -- drbdadm status
pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9 role:Secondary suspended:quorum
  disk:UpToDate quorum:no blocked:upper
  master20.host connection:Connecting
  node51.host connection:Connecting
root@master20:~# kubectl  -n piraeus-datastore exec ha-controller-bb47w -- drbdadm secondary --force pvc-2028ef6c-82a1-4e7f-8bdf-4179cee1bbd9
no resources defined!
command terminated with exit code 1

@WanzenBug
Copy link
Member

Sorry, should have been drbdsetup secondary --force

@yeshl
Copy link
Author

yeshl commented Apr 27, 2024

Can it reconnect and recovery be done automatically?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants