Pods with attached PV staying in Terminating state when a worker node fails #419

sati-max · 2024-03-18T16:42:59Z

Hi

We have few SigNoz setups running on a self-hosted kubernetes cluster. Overall it works nice and we don't have any problems with SigNoz. We decided to verify if our setup is tolerant of worker node failure. And sadly SigNoz pods that have attached PV (clickhouse, zookeeper, alert manager, query service) are stuck in Terminating state and won't be recreated/started on a working node in the cluster and basically wait until the "failed" node comes back. Which might be 5 minutes (when it's planned downtime) but also 5 hours (if there is actual problem and it takes time to solve it) during which SigNoz might not collect any data send to it from any endpoints...

View:

When checking (all mentioned above pods have the same info) from Rancher, this is what we see in Events and Conditions:

We don't have any other logs. kubectl describe pod <pod name> -n <namespace> doesn't say anything else.

How can we solve this issue? It seems that the pods that have attached PV for some reason can't be recreated. Other pods are in a dual state (Running on a different worker node and Terminating on "failed" node) and are working ok, and are getting sorted out when the "failed" node comes back.

We didn't really change anything regarding YAML configuration (only changed ports on otel-collector, disabled k8s-infra pod and increased the side of Clickhouse PV), so I won't paste it.

Thank you for any help.

Cheeks

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods with attached PV staying in Terminating state when a worker node fails #419

Pods with attached PV staying in Terminating state when a worker node fails #419

sati-max commented Mar 18, 2024

Pods with attached PV staying in Terminating state when a worker node fails #419

Pods with attached PV staying in Terminating state when a worker node fails #419

Comments

sati-max commented Mar 18, 2024