Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting() #306

Open
sync-by-unito bot opened this issue Mar 29, 2022 · 0 comments
Labels
help-wanted Extra attention is needed zh:Icebox Issues in the ZenHub pipeline 'Icebox'

Comments

@sync-by-unito
Copy link

sync-by-unito bot commented Mar 29, 2022

Problem

We find that cass operator might fail to update the LastServerNodeStarted of the CassandraDatacenter CR if it crashes in the middle of labelServerPodStarting() and restarts, which further affects the cooldownTime.

The labelServerPodStarting() is only invoked when the cassandra.datastax.com/node-state label of the Cassandra pod is Ready-to-Start (inside startOneNodePerRack):

func (rc *ReconciliationContext) startOneNodePerRack(...) (string, error) {
		...
		for _, pod := range rc.dcPods {
			...
			if !isServerReadyToStart(pod) || !mgmtApiUp {
				continue
			}
			...
			// startCassandra calls labelServerPodStarting
			if err := rc.startCassandra(endpointData, pod); err != nil {
				return "", err
			}
			...
		}
		...
}

And inside labelServerPodStarting(), the cass opertor does two things:

  1. update the cassandra.datastax.com/node-state label of the Cassandra pod to Starting
  2. update the LastServerNodeStarted of the CassandraDatacenter CR with the current timestamp

as the code shows:

func (rc *ReconciliationContext) labelServerPodStarting(pod *corev1.Pod) error {
	...
	pod.Labels[api.CassNodeState] = stateStarting
	err := rc.Client.Patch(ctx, pod, podPatch)
	...
	dc.Status.LastServerNodeStarted = metav1.Now()
	err = rc.Client.Status().Patch(rc.Ctx, dc, statusPatch)
	...
}

If the cass operator crashes between 1 and 2, it creates an intermediate state where the cassandra.datastax.com/node-state label of the Cassandra pod is Starting but the LastServerNodeStarted is not set yet. Later when it restarts and enters startOneNodePerRack() again, the operator finds that the label of the Cassandra pod is not Ready-to-Start will not invoke LastServerNodeStarted again. The LastServerNodeStarted remains unset, which can further affect computing cooldownTime.

Fix

There can be multiple potential solutions to fix it. For example, we can update the if condition in startOneNodePerRack to include Starting. We are willing to help fix the issue.

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-44

@adejanovski adejanovski added the zh:Icebox Issues in the ZenHub pipeline 'Icebox' label Jul 26, 2022
@adejanovski adejanovski moved this to To Groom in K8ssandra Nov 8, 2022
@adejanovski adejanovski moved this from To Groom to Icebox in K8ssandra Apr 27, 2023
@adejanovski adejanovski added the help-wanted Extra attention is needed label Jan 24, 2024
@sync-by-unito sync-by-unito bot changed the title K8SSAND-1362 ⁃ [BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting() [BUG] cass operator fails to correctly update the LastServerNodeStarted timestamp if crash in the middle of labelServerPodStarting() Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted Extra attention is needed zh:Icebox Issues in the ZenHub pipeline 'Icebox'
Projects
No open projects
Archived in project
Development

No branches or pull requests

1 participant