Skip to content

Conversation

Balaji632
Copy link

Kubernetes Day2 SOP

Copy link
Contributor

@avinash-platformatory avinash-platformatory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Since this document is just for Confluent Platform Day 2 operations, it should be renamed accordingly. A generic Kubernetes SOP and kafka workload specfic SOPs would be ideal so that the Kubernetes specific SOP can be reused for any kafka workload, not just Confluent.
  • Format the commands using code blocks with the langauge specified. This allows the operator to copy the command using a button and formats the command based on the langauge specified. In most cases, it would be bash.
    Example -
3. Check pod health across namespaces. No pods should be stuck in CrashLoopBackOff or Pending state.
```bash
kubectl get pods --all-namespaces
```

would become
Screenshot_20250821_132448

  • The document seems too generic and not specific to StreamTime. My suggestion would be to deploy a cluster and perform these activities through StreamTime and manually using kubectl, documenting the actions performed. I have not reviewed it thoroughly since it is too generic.

@@ -0,0 +1,521 @@
---
title: Kubernetes Day2 Standard Operating Procedures (SOP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove title so that it does not show up in the navbar. We will still be able to accessing using /operations/day2sop.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

---
title: Kubernetes Day2 Standard Operating Procedures (SOP)
nav_order: 4
layout: Operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be parent: Operations

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

**Rollback**
* Not applicable; this SOP is for verification only.

### 2. Namespace Creation & Resource Quota Adjustment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace Creation will not be part of Day2 operations since the namespaces would already be created during the creation of the cluster. Rollback mentions deleting the namespace - not sure about the use case where this is helpful from a Day2 perspective.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section deleted. Its not required for clusters created via fleet manager.

**Prerequisites**

* kubectl access with cluster-admin privileges
* Maintenance window approved in change management system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prerequisite should include adding a new node or ensuring sufficient capacity on the other existing nodes so that workloads on this node do not go unscheduled. If the workload is a kafka pod, just checking for capacity might not suffice due to pod anti affinity rules.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated section.

**Procedure**
1. Identify broker pods:
```$> kubectl get pods -n <cfk-namespace\> -l app=kafka```
2. Restart a broker (one at a time):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to restart all brokers sequentially, doing a statefulset rollout is a better option since the rollout will be controlled by Kubernetes and not done manually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

1. Create or update Kubernetes Secret:
```kubectl create secret generic <secret-name> --from-literal=username=<user> --from-literal=password=<password> -n <namespace>```
2. Patch the deployment or StatefulSet to mount the updated secret.
3. Restart affected pods if they don’t pick up new secrets automatically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to restart pods for adding a new SASL/PLAIN user in CFK operator deployed clusters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

* Destroy the cluster using the same tool (terraform destroy, eksctl delete cluster, etc.) if provisioning fails.


### 10. Credential Addition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section needs to be specific to the cluster type and authentication mechanism i.e., adding a SASL/PLAIN user for CFK kafka is different from adding a basic auth user for CFK Schema Registry and is different from adding SASL/SCRAM users for redpanda or any other cluster type or authentication mechanism. This is too generic and not helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

1. Review release notes for the target CFK version (check compatibility with Kubernetes and CP)
2. Backup existing CFK configuration:
```$> kubectl get confluent -n <namespace> -o yaml > cfk-all-backup.yaml```
3. Upgrade the CFK Operator Helm chart or manifest:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. The Confluent documentation is the best resource for this. We will support this through StreamTime soon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

section updated for operatro upgrades.

* For YAML manifests: apply the updated manifest
4. Monitor Operator logs to ensure no failures:
```kubectl logs -n <namespace> deploy/confluent-operator```
5. Sequentially upgrade platform components (Kafka, Zookeeper, Connect, SR, etc.) using the updated CRDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is best done through StreamTime but the steps for doing it manually should also be documented i.e., updating the image versions in the CRD, if it is not a major version upgrade.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setion updated dor operator upgrades.

**Procedure**
1. Download the updated operator manifest from Confluent’s repository.
2. Apply the updated manifest:
```$> kubectl apply -f <cfk-operator-manifest>.yaml```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrades should be done through helm and follow the procedure documented in the Confluent documentation.Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. We will support this through StreamTime soon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section updated as per CFK documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants