Kubernetes Day2 SOP #2

Balaji632 · 2025-08-21T06:00:32Z

Kubernetes Day2 SOP

avinash-platformatory

Since this document is just for Confluent Platform Day 2 operations, it should be renamed accordingly. A generic Kubernetes SOP and kafka workload specfic SOPs would be ideal so that the Kubernetes specific SOP can be reused for any kafka workload, not just Confluent.
Format the commands using code blocks with the langauge specified. This allows the operator to copy the command using a button and formats the command based on the langauge specified. In most cases, it would be bash.
Example -

3. Check pod health across namespaces. No pods should be stuck in CrashLoopBackOff or Pending state.
```bash
kubectl get pods --all-namespaces
```

would become

The document seems too generic and not specific to StreamTime. My suggestion would be to deploy a cluster and perform these activities through StreamTime and manually using kubectl, documenting the actions performed. I have not reviewed it thoroughly since it is too generic.

avinash-platformatory · 2025-08-21T07:51:27Z

operations/day2sop.md

@@ -0,0 +1,521 @@
+---
+title: Kubernetes Day2 Standard Operating Procedures (SOP)


Remove title so that it does not show up in the navbar. We will still be able to accessing using /operations/day2sop.html

avinash-platformatory · 2025-08-21T07:52:15Z

operations/day2sop.md

+---
+title: Kubernetes Day2 Standard Operating Procedures (SOP)
+nav_order: 4
+layout: Operations


It should be parent: Operations

avinash-platformatory · 2025-08-21T09:11:34Z

operations/day2sop.md

+**Rollback**
+* Not applicable; this SOP is for verification only.
+
+### 2. Namespace Creation & Resource Quota Adjustment


Namespace Creation will not be part of Day2 operations since the namespaces would already be created during the creation of the cluster. Rollback mentions deleting the namespace - not sure about the use case where this is helpful from a Day2 perspective.

Section deleted. Its not required for clusters created via fleet manager.

avinash-platformatory · 2025-08-21T09:39:59Z

operations/day2sop.md

+**Prerequisites**
+
+* kubectl access with cluster-admin privileges  
+* Maintenance window approved in change management system


Prerequisite should include adding a new node or ensuring sufficient capacity on the other existing nodes so that workloads on this node do not go unscheduled. If the workload is a kafka pod, just checking for capacity might not suffice due to pod anti affinity rules.

Updated section.

avinash-platformatory · 2025-08-21T09:40:58Z

operations/day2sop.md

+**Procedure**
+1. Identify broker pods:  
+   ```$> kubectl get pods -n <cfk-namespace\> -l app=kafka```
+2. Restart a broker (one at a time):  


If the goal is to restart all brokers sequentially, doing a statefulset rollout is a better option since the rollout will be controlled by Kubernetes and not done manually.

avinash-platformatory · 2025-08-21T09:49:49Z

operations/day2sop.md

+1. Create or update Kubernetes Secret:
+```kubectl create secret generic <secret-name> --from-literal=username=<user> --from-literal=password=<password> -n <namespace>``` 
+2. Patch the deployment or StatefulSet to mount the updated secret.  
+3. Restart affected pods if they don’t pick up new secrets automatically.


No need to restart pods for adding a new SASL/PLAIN user in CFK operator deployed clusters.

avinash-platformatory · 2025-08-21T09:51:26Z

operations/day2sop.md

+* Destroy the cluster using the same tool (terraform destroy, eksctl delete cluster, etc.) if provisioning fails.
+
+
+### 10. Credential Addition


This section needs to be specific to the cluster type and authentication mechanism i.e., adding a SASL/PLAIN user for CFK kafka is different from adding a basic auth user for CFK Schema Registry and is different from adding SASL/SCRAM users for redpanda or any other cluster type or authentication mechanism. This is too generic and not helpful.

avinash-platformatory · 2025-08-21T09:53:14Z

operations/day2sop.md

+1. Review release notes for the target CFK version (check compatibility with Kubernetes and CP)
+2. Backup existing CFK configuration:
+```$> kubectl get confluent -n <namespace> -o yaml > cfk-all-backup.yaml```
+3. Upgrade the CFK Operator Helm chart or manifest:  


Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. The Confluent documentation is the best resource for this. We will support this through StreamTime soon.

section updated for operatro upgrades.

avinash-platformatory · 2025-08-21T09:54:37Z

operations/day2sop.md

+   * For YAML manifests: apply the updated manifest
+4. Monitor Operator logs to ensure no failures:
+```kubectl logs -n <namespace> deploy/confluent-operator```
+5. Sequentially upgrade platform components (Kafka, Zookeeper, Connect, SR, etc.) using the updated CRDs.


This is best done through StreamTime but the steps for doing it manually should also be documented i.e., updating the image versions in the CRD, if it is not a major version upgrade.

Setion updated dor operator upgrades.

avinash-platformatory · 2025-08-21T09:56:32Z

operations/day2sop.md

+**Procedure**
+1. Download the updated operator manifest from Confluent’s repository.  
+2. Apply the updated manifest:  
+```$> kubectl apply -f <cfk-operator-manifest>.yaml```


Upgrades should be done through helm and follow the procedure documented in the Confluent documentation.Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. We will support this through StreamTime soon.

Section updated as per CFK documentation.

Kubernetes Day2 SOP

79e3919

Balaji632 requested review from Sathishkumar0404 and avinash-platformatory August 21, 2025 06:00

Balaji632 added 3 commits August 21, 2025 11:33

Minor correction with formatting. Removed Table of contents

a426c69

Corrected formatting

e7b0668

Corrected formatting in Cluster Health Verification

8ba58b5

avinash-platformatory requested changes Aug 21, 2025

View reviewed changes

Updated day2 sop with review comments.

4b76d45

		@@ -0,0 +1,521 @@
		---
		title: Kubernetes Day2 Standard Operating Procedures (SOP)

		* Destroy the cluster using the same tool (terraform destroy, eksctl delete cluster, etc.) if provisioning fails.


		### 10. Credential Addition

Kubernetes Day2 SOP #2

Are you sure you want to change the base?

Kubernetes Day2 SOP #2

Uh oh!

Conversation

Balaji632 commented Aug 21, 2025

Uh oh!

avinash-platformatory left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!