-
Notifications
You must be signed in to change notification settings - Fork 0
Kubernetes Day2 SOP #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Since this document is just for Confluent Platform Day 2 operations, it should be renamed accordingly. A generic Kubernetes SOP and kafka workload specfic SOPs would be ideal so that the Kubernetes specific SOP can be reused for any kafka workload, not just Confluent.
- Format the commands using code blocks with the langauge specified. This allows the operator to copy the command using a button and formats the command based on the langauge specified. In most cases, it would be bash.
Example -
3. Check pod health across namespaces. No pods should be stuck in CrashLoopBackOff or Pending state.
```bash
kubectl get pods --all-namespaces
```
- The document seems too generic and not specific to StreamTime. My suggestion would be to deploy a cluster and perform these activities through StreamTime and manually using kubectl, documenting the actions performed. I have not reviewed it thoroughly since it is too generic.
operations/day2sop.md
Outdated
@@ -0,0 +1,521 @@ | |||
--- | |||
title: Kubernetes Day2 Standard Operating Procedures (SOP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove title so that it does not show up in the navbar. We will still be able to accessing using /operations/day2sop.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
operations/day2sop.md
Outdated
--- | ||
title: Kubernetes Day2 Standard Operating Procedures (SOP) | ||
nav_order: 4 | ||
layout: Operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be parent: Operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
operations/day2sop.md
Outdated
**Rollback** | ||
* Not applicable; this SOP is for verification only. | ||
|
||
### 2. Namespace Creation & Resource Quota Adjustment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namespace Creation will not be part of Day2 operations since the namespaces would already be created during the creation of the cluster. Rollback mentions deleting the namespace - not sure about the use case where this is helpful from a Day2 perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Section deleted. Its not required for clusters created via fleet manager.
operations/day2sop.md
Outdated
**Prerequisites** | ||
|
||
* kubectl access with cluster-admin privileges | ||
* Maintenance window approved in change management system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prerequisite should include adding a new node or ensuring sufficient capacity on the other existing nodes so that workloads on this node do not go unscheduled. If the workload is a kafka pod, just checking for capacity might not suffice due to pod anti affinity rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated section.
**Procedure** | ||
1. Identify broker pods: | ||
```$> kubectl get pods -n <cfk-namespace\> -l app=kafka``` | ||
2. Restart a broker (one at a time): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to restart all brokers sequentially, doing a statefulset rollout is a better option since the rollout will be controlled by Kubernetes and not done manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
operations/day2sop.md
Outdated
1. Create or update Kubernetes Secret: | ||
```kubectl create secret generic <secret-name> --from-literal=username=<user> --from-literal=password=<password> -n <namespace>``` | ||
2. Patch the deployment or StatefulSet to mount the updated secret. | ||
3. Restart affected pods if they don’t pick up new secrets automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to restart pods for adding a new SASL/PLAIN user in CFK operator deployed clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
operations/day2sop.md
Outdated
* Destroy the cluster using the same tool (terraform destroy, eksctl delete cluster, etc.) if provisioning fails. | ||
|
||
|
||
### 10. Credential Addition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section needs to be specific to the cluster type and authentication mechanism i.e., adding a SASL/PLAIN user for CFK kafka is different from adding a basic auth user for CFK Schema Registry and is different from adding SASL/SCRAM users for redpanda or any other cluster type or authentication mechanism. This is too generic and not helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
operations/day2sop.md
Outdated
1. Review release notes for the target CFK version (check compatibility with Kubernetes and CP) | ||
2. Backup existing CFK configuration: | ||
```$> kubectl get confluent -n <namespace> -o yaml > cfk-all-backup.yaml``` | ||
3. Upgrade the CFK Operator Helm chart or manifest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. The Confluent documentation is the best resource for this. We will support this through StreamTime soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
section updated for operatro upgrades.
operations/day2sop.md
Outdated
* For YAML manifests: apply the updated manifest | ||
4. Monitor Operator logs to ensure no failures: | ||
```kubectl logs -n <namespace> deploy/confluent-operator``` | ||
5. Sequentially upgrade platform components (Kafka, Zookeeper, Connect, SR, etc.) using the updated CRDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is best done through StreamTime but the steps for doing it manually should also be documented i.e., updating the image versions in the CRD, if it is not a major version upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setion updated dor operator upgrades.
operations/day2sop.md
Outdated
**Procedure** | ||
1. Download the updated operator manifest from Confluent’s repository. | ||
2. Apply the updated manifest: | ||
```$> kubectl apply -f <cfk-operator-manifest>.yaml``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrades should be done through helm and follow the procedure documented in the Confluent documentation.Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. We will support this through StreamTime soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Section updated as per CFK documentation.
Kubernetes Day2 SOP