[BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting() #306

sync-by-unito · 2022-03-29T22:01:50Z

Problem

We find that cass operator might fail to update the LastServerNodeStarted of the CassandraDatacenter CR if it crashes in the middle of labelServerPodStarting() and restarts, which further affects the cooldownTime.

The labelServerPodStarting() is only invoked when the cassandra.datastax.com/node-state label of the Cassandra pod is Ready-to-Start (inside startOneNodePerRack):

func (rc *ReconciliationContext) startOneNodePerRack(...) (string, error) {
		...
		for _, pod := range rc.dcPods {
			...
			if !isServerReadyToStart(pod) || !mgmtApiUp {
				continue
			}
			...
			// startCassandra calls labelServerPodStarting
			if err := rc.startCassandra(endpointData, pod); err != nil {
				return "", err
			}
			...
		}
		...
}

And inside labelServerPodStarting(), the cass opertor does two things:

update the cassandra.datastax.com/node-state label of the Cassandra pod to Starting
update the LastServerNodeStarted of the CassandraDatacenter CR with the current timestamp

as the code shows:

func (rc *ReconciliationContext) labelServerPodStarting(pod *corev1.Pod) error {
	...
	pod.Labels[api.CassNodeState] = stateStarting
	err := rc.Client.Patch(ctx, pod, podPatch)
	...
	dc.Status.LastServerNodeStarted = metav1.Now()
	err = rc.Client.Status().Patch(rc.Ctx, dc, statusPatch)
	...
}

If the cass operator crashes between 1 and 2, it creates an intermediate state where the cassandra.datastax.com/node-state label of the Cassandra pod is Starting but the LastServerNodeStarted is not set yet. Later when it restarts and enters startOneNodePerRack() again, the operator finds that the label of the Cassandra pod is not Ready-to-Start will not invoke LastServerNodeStarted again. The LastServerNodeStarted remains unset, which can further affect computing cooldownTime.

Fix

There can be multiple potential solutions to fix it. For example, we can update the if condition in startOneNodePerRack to include Starting. We are willing to help fix the issue.

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-44

The text was updated successfully, but these errors were encountered:

adejanovski added the zh:Icebox Issues in the ZenHub pipeline 'Icebox' label Jul 26, 2022

adejanovski moved this to To Groom in K8ssandra Nov 8, 2022

adejanovski added this to K8ssandra Nov 8, 2022

adejanovski moved this from To Groom to Icebox in K8ssandra Apr 27, 2023

adejanovski added the help-wanted Extra attention is needed label Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting() #306

[BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting() #306

sync-by-unito bot commented Mar 29, 2022 •

edited

Loading

[BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting() #306

[BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting() #306

Comments

sync-by-unito bot commented Mar 29, 2022 • edited Loading

Problem

Fix

sync-by-unito bot commented Mar 29, 2022 •

edited

Loading