You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We find that cass operator might fail to update the LastServerNodeStarted of the CassandraDatacenter CR if it crashes in the middle of labelServerPodStarting() and restarts, which further affects the cooldownTime.
The labelServerPodStarting() is only invoked when the cassandra.datastax.com/node-state label of the Cassandra pod is Ready-to-Start (inside startOneNodePerRack):
func (rc *ReconciliationContext) startOneNodePerRack(...) (string, error) {
...
for _, pod := range rc.dcPods {
...
if !isServerReadyToStart(pod) || !mgmtApiUp {
continue
}
...
// startCassandra calls labelServerPodStarting
if err := rc.startCassandra(endpointData, pod); err != nil {
return "", err
}
...
}
...
}
And inside labelServerPodStarting(), the cass opertor does two things:
update the cassandra.datastax.com/node-state label of the Cassandra pod to Starting
update the LastServerNodeStarted of the CassandraDatacenter CR with the current timestamp
If the cass operator crashes between 1 and 2, it creates an intermediate state where the cassandra.datastax.com/node-state label of the Cassandra pod is Starting but the LastServerNodeStarted is not set yet. Later when it restarts and enters startOneNodePerRack() again, the operator finds that the label of the Cassandra pod is not Ready-to-Start will not invoke LastServerNodeStarted again. The LastServerNodeStarted remains unset, which can further affect computing cooldownTime.
Fix
There can be multiple potential solutions to fix it. For example, we can update the if condition in startOneNodePerRack to include Starting. We are willing to help fix the issue.
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-44
The text was updated successfully, but these errors were encountered:
sync-by-unitobot
changed the title
K8SSAND-1362 ⁃ [BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting()
[BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting()
Oct 11, 2024
Problem
We find that cass operator might fail to update the
LastServerNodeStarted
of theCassandraDatacenter
CR if it crashes in the middle oflabelServerPodStarting()
and restarts, which further affects thecooldownTime
.The
labelServerPodStarting()
is only invoked when thecassandra.datastax.com/node-state
label of the Cassandra pod isReady-to-Start
(insidestartOneNodePerRack
):And inside
labelServerPodStarting()
, the cass opertor does two things:cassandra.datastax.com/node-state
label of the Cassandra pod toStarting
LastServerNodeStarted
of the CassandraDatacenter CR with the current timestampas the code shows:
If the cass operator crashes between 1 and 2, it creates an intermediate state where the
cassandra.datastax.com/node-state
label of the Cassandra pod isStarting
but theLastServerNodeStarted
is not set yet. Later when it restarts and entersstartOneNodePerRack()
again, the operator finds that the label of the Cassandra pod is notReady-to-Start
will not invokeLastServerNodeStarted
again. TheLastServerNodeStarted
remains unset, which can further affect computingcooldownTime
.Fix
There can be multiple potential solutions to fix it. For example, we can update the if condition in
startOneNodePerRack
to includeStarting
. We are willing to help fix the issue.┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-44
The text was updated successfully, but these errors were encountered: