You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have few SigNoz setups running on a self-hosted kubernetes cluster. Overall it works nice and we don't have any problems with SigNoz. We decided to verify if our setup is tolerant of worker node failure. And sadly SigNoz pods that have attached PV (clickhouse, zookeeper, alert manager, query service) are stuck in Terminating state and won't be recreated/started on a working node in the cluster and basically wait until the "failed" node comes back. Which might be 5 minutes (when it's planned downtime) but also 5 hours (if there is actual problem and it takes time to solve it) during which SigNoz might not collect any data send to it from any endpoints...
View:
When checking (all mentioned above pods have the same info) from Rancher, this is what we see in Events and Conditions:
We don't have any other logs. kubectl describe pod <pod name> -n <namespace> doesn't say anything else.
How can we solve this issue? It seems that the pods that have attached PV for some reason can't be recreated. Other pods are in a dual state (Running on a different worker node and Terminating on "failed" node) and are working ok, and are getting sorted out when the "failed" node comes back.
We didn't really change anything regarding YAML configuration (only changed ports on otel-collector, disabled k8s-infra pod and increased the side of Clickhouse PV), so I won't paste it.
Thank you for any help.
Cheeks
The text was updated successfully, but these errors were encountered:
Hi
We have few SigNoz setups running on a self-hosted kubernetes cluster. Overall it works nice and we don't have any problems with SigNoz. We decided to verify if our setup is tolerant of worker node failure. And sadly SigNoz pods that have attached PV (clickhouse, zookeeper, alert manager, query service) are stuck in Terminating state and won't be recreated/started on a working node in the cluster and basically wait until the "failed" node comes back. Which might be 5 minutes (when it's planned downtime) but also 5 hours (if there is actual problem and it takes time to solve it) during which SigNoz might not collect any data send to it from any endpoints...
View:
When checking (all mentioned above pods have the same info) from Rancher, this is what we see in Events and Conditions:
We don't have any other logs.
kubectl describe pod <pod name> -n <namespace>
doesn't say anything else.How can we solve this issue? It seems that the pods that have attached PV for some reason can't be recreated. Other pods are in a dual state (Running on a different worker node and Terminating on "failed" node) and are working ok, and are getting sorted out when the "failed" node comes back.
We didn't really change anything regarding YAML configuration (only changed ports on otel-collector, disabled k8s-infra pod and increased the side of Clickhouse PV), so I won't paste it.
Thank you for any help.
Cheeks
The text was updated successfully, but these errors were encountered: