Description
Hi,
I'm using clickhouse-operator version 0.24.0 and I've encountered the following issue:
When applying new change to clickhouseKeeper cluster operator does not ensure that a ClickHouseKeeper pod is running before proceeding with the restart of another pod (even though the previous one is still being created).
Let's look at the status of the pods when I changed ClickHouseKeeperInstallation
:
Cluster is applying the new change:
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Terminating 0 6m10s
chk-extended-cluster1-0-1-0 1/1 Running 0 6m10s
chk-extended-cluster1-0-2-0 1/1 Running 0 5m25s
(...)
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 0/1 ContainerCreating 0 65s
chk-extended-cluster1-0-1-0 1/1 Running 0 7m19s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m34s
(...)
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 78s
chk-extended-cluster1-0-1-0 1/1 Running 0 7m32s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m47s
So far so good, but let's see what happens next:
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 81s
chk-extended-cluster1-0-1-0 1/1 Terminating 0 7m35s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m50s
(...)
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 82s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m51s
and here goes our problem
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 86s
chk-extended-cluster1-0-1-0 0/1 ContainerCreating 0 1s
chk-extended-cluster1-0-2-0 1/1 Terminating 0 6m55s
As you can see, pod cluster1-0-1-0
is still in ContainerCreating
state, but the operator has already decided to terminate pod cluster1-0-2-0
.
This caused the cluster to lose quorum for a short time, which ClickHouse did not liked, resulting in the following error:
error": "(CreateMemoryTableQueryOnCluster) Error when executing query: code: 999, message: All connection tries failed while connecting to ZooKeeper. nodes: 10.233.71.16:9181, 10.233.81.20:9181, 10.233.70.35:9181\nCode: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or is stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.8.2.3 (official build)), 10.233.71.16:9181\nPoco::Exception. Code: 1000, e.code() = 111, Connection refused (version 24.8.2.3 (official build))
I was expecting that clickhouse keeper cluster will apply new changes without any disruptions to clickhouse cluster.