From 79e3919613b4f14af82491780850c4fd1101c9d1 Mon Sep 17 00:00:00 2001 From: Balaji632 Date: Thu, 21 Aug 2025 11:29:46 +0530 Subject: [PATCH 1/5] Kubernetes Day2 SOP --- operations/day2sop.md | 542 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 542 insertions(+) create mode 100644 operations/day2sop.md diff --git a/operations/day2sop.md b/operations/day2sop.md new file mode 100644 index 0000000..5a2225a --- /dev/null +++ b/operations/day2sop.md @@ -0,0 +1,542 @@ +--- +title: Kubernetes Day2 Standard Operating Procedures (SOP) +nav_order: 4 +layout: Operations + +--- + +**Table of Contents** + +[ Introduction](#heading=) +[Scope](#heading=) +     [In Scope](#2.1-in-scope) +     [Out of Scope](#2.2-out-of-scope) + +[Day 2 Operations Overview](#heading=) +[Support Model & SLA5](#4.-support-model-&-sla) +     [Support Model Structure](#heading=) +     [Communication Channels](#heading=) +     [SLA Definitions](#heading=) +[Kubernetes Day 2 SOPs](#5.-kubernetes-day-2-sops) +     [1 Cluster Health Verification](#5.1-cluster-health-verification) +     [2 Namespace Creation & Resource Quota Adjustment7](#5.2-namespace-creation-&-resource-quota-adjustment) +     [5.3 Node Maintenance & Drain](#5.3-node-maintenance-&-drain) +     [5.4 Restart Kafka Brokers (CFK)](#5.4-restart-kafka-brokers-\(cfk\)) +     [5.5 Retrieve Logs (within 1 \- 2 hour SLA)](#5.5-retrieve-logs-\(within-1---2-hour-sla\)) +     [5.6 System Status Assessment](#5.6-system-status-assessment) +     [5.7 JMX Metrics Retrieval (Kafka & CFK)](#5.7-jmx-metrics-retrieval-\(kafka-&-cfk\)) +     [5.8 Enhanced / Extended Alerts Setup](#5.8-enhanced-/-extended-alerts-setup) +     [5.9 Cluster Creation](#5.9-cluster-creation) +     [5.10 Credential Addition](#5.10-credential-addition) +     [5.11 Confluent Platform (CP) Upgrades](#5.11-confluent-platform-\(cp\)-upgrades) +     [5.12 Operator Upgrades](#5.12-operator-upgrades) +     [5.13 Rollback Procedure](#5.13-rollback-procedure) +     [5.14 Download Logs](#5.14-download-logs) + +# Introduction + +This document serves as the Day 2 Operations Runbook for managing Kubernetes clusters deployed with Confluent for Kubernetes (CFK) in a production environment. It contains detailed Standard Operating Procedures (SOPs) for operational tasks, incident handling, and lifecycle management to ensure the cluster and Confluent Platform components remain secure, stable, and performant after initial deployment. + +**Purpose** + +* Provide a single reference for daily operational tasks, troubleshooting, and change management. +* Reduce operational risk by defining consistent processes for common maintenance and incident scenarios. +* Enable faster resolution times through documented step-by-step procedures and validation checks. + +**Audience** + +This runbook is intended for: + +* Site Reliability Engineers (SREs) +* DevOps Teams +* Incident Response Teams + +# Scope + +This runbook defines the operational processes, maintenance routines, and incident-handling procedures for Kubernetes and Confluent Platform in a production setting. It covers Day 2 operations, all activities performed after the initial cluster deployment and to ensure continuous availability, scalability, and security. + +##   In Scope + +* **Kubernetes Day 2 Operations** + * Cluster maintenance and upgrades + * Node lifecycle management (add/remove/drain/maintenance) + * Namespace and resource quota adjustments + * Persistent volume management + * Network policy changes + * Monitoring, alerting, and logging setup + * Backup and restore procedures (*if applicable*) + * Security patching (*if applicable*) + + +* **CFK / Confluent Platform Day 2 Operations** + * Broker restart procedures + * System status assessment + * Log retrieval and retention + * JMX metrics collection + * Enhanced/extended alerting setup + * Cluster creation and BAU support + * Credential and RBAC management + * Confluent Platform upgrades + * Operator upgrades + * Connector lifecycle management + * Capacity planning and scaling + * Security updates for Kafka components +* **Incident Response** + * Severity classification (P1, BAU) + * Escalation and communication protocol + * On-call roster management +* **Change Management** + * Request submission, approval workflow + * Audit trail and rollback procedures + +##   Out of Scope + +* Initial design and deployment of Kubernetes clusters (Day 0/Day 1 activities) +* Application-level configuration or business logic changes +* Vendor-specific SLA enforcement outside the agreed operational scope +* Direct management of non-Kubernetes infrastructure + +# Day 2 Operations Overview + +**Definition** + +Day 2 operations refer to all post-deployment activities required to keep the Kubernetes cluster and Confluent Platform workloads running efficiently. Where Day 0 is planning and Day 1 is deployment, Day 2 is about ongoing care, feeding, and evolution of the system. + +**Key Objectives** + +* Maintain cluster health and application availability. +* Ensure security and compliance through patches, upgrades, and RBAC enforcement. +* Enable scalability as workloads and traffic grow. +* Detect and resolve incidents quickly to minimize downtime. +* Provide operational actions for audit purpose + +**Primary Activities in Day 2** + +* **Monitoring & Health Checks** – Continuous observation of cluster and Kafka components, with proactive remediation for anomalies. +* **Capacity & Performance Management** – Scaling nodes, adjusting resource quotas, and tuning workloads to prevent bottlenecks. +* **Incident Handling** – Classifying issues (P1, BAU), executing predefined SOPs, and ensuring communication to stakeholders. +* **Maintenance Tasks** – Controlled restarts, version upgrades, security patching, and operator updates. +* **Change Management** – Applying new configurations or features in a controlled and reversible manner. +* **Security Operations** – Managing credentials, RBAC roles, network policies, and security scans. +* **Backup & Recovery** – Ensuring business continuity through tested recovery procedures. + +# Support Model & SLA + +A well-defined Support Model ensures that incidents are handled efficiently, with clear escalation paths and measurable Service Level Agreements (SLAs) for different severity. This section outlines how the support team will operate, who responds, and how quickly issues are resolved. + +### **Support Model Structure** + +* **Engineering / Vendor Support** + * Addresses product defects, advanced performance tuning, and deep-root cause analysis. + * Works with Confluent or Kubernetes vendors as needed. + +### **Communication Channels** + +* **Primary:** Incident Tracker +* **Secondary:** + * WhatsApp group + * Slack Channel + +### **SLA Definitions** + +| Severity | Description | Response Time | +| ----- | :---- | ----- | +| **Critical** | Complete outage or severe degradation impacting production with no workaround. | 30 minutes | +| **Others** | Functionality loss or performance impact with a workaround available | 60 mins \- 2 Hrs | + +* Engineering Team/Vendor acknowledges the incident in Incident Tracker within SLA response time. +* Keep the client updated at agreed communication intervals. + +# Kubernetes Day 2 SOPs + +This section documents step-by-step procedures for core Kubernetes operational tasks. Each SOP follows the structure: + +* **Purpose** – Why the task is needed. +* **Prerequisites** – Access, permissions, or tools required. +* **Procedure** – Step-by-step commands and actions. +* **Validation** – How to confirm successful execution. +* **Rollback** – How to revert changes if needed. + +### 1. Cluster Health Verification + +**Purpose** + To confirm that the Kubernetes cluster is functioning correctly, all nodes are ready, and workloads are running as expected. + +**Prerequisites** +* kubectl access with cluster-admin privileges +* Access to monitoring tools (Prometheus/Grafana or equivalent) + +**Procedure** +1. Check node readiness and nsure all nodes have STATUS=Ready. + ```kubectl get nodes -o wide``` + +2. Check control plane components and verify etcd, scheduler, and controller-manager are healthy. + ```$> kubectl get componentstatuses``` + +3. Check pod health across namespaces. No pods should be stuck in CrashLoopBackOff or Pending state. + ```$> kubectl get pods --all-namespaces``` + +4. Check persistent volumes. All PVs should be in Bound state. + ```$> kubectl get pv``` +5. Check cluster events + ```$> kubectl get events --sort-by=.metadata.creationTimestamp``` + +**Validation** +* All nodes show Ready status. +* All pods in operational namespaces are running without restarts beyond acceptable thresholds. +* No critical events are reported. + +**Rollback** +* Not applicable; this SOP is for verification only. + +### 2. Namespace Creation & Resource Quota Adjustment + +**Purpose** + To create new Kubernetes namespaces or modify quotas for existing workloads while maintaining fair resource allocation. + +**Prerequisites** +* kubectl access with appropriate RBAC permissions +* Approved change request + +**Procedure** +1. Create a new namespace + ```$\> kubectl create namespace \``` +2. Apply a resource quota + ``` + cat \<\ + namespace: \ + spec: + hard: + cpu: "10" + memory: 32Gi + pods: "50" + EOF + ``` +**Update quota if needed** +```$> kubectl edit resourcequota -n ``` + +**Validation** +```$> kubectl describe namespace ``` +```$> kubectl describe resourcequota -n ``` +Check that the quota matches approved values. + +**Rollback** +* To delete namespace: +```$> kubectl delete namespace ``` +* To revert quota changes, restore from the previous YAML definition. + +### 3. Node Maintenance & Drain {#5.3-node-maintenance-&-drain} + +**Purpose** + To safely perform maintenance on a Kubernetes node (e.g., kernel patching, hardware replacement) without impacting running workloads. + +**Prerequisites** + +* kubectl access with cluster-admin privileges +* Maintenance window approved in change management system + +**Procedure** + +1. Mark node as unschedulable: +```$> kubectl cordon ``` + +2. Drain workloads from the node: +```$> kubectl drain --ignore-daemonsets --delete-emptydir-data``` + +3. Perform maintenance (OS updates, hardware replacement, etc.). +4. Bring node back into scheduling: + $\> kubectl uncordon \ + +**Validation** + +* *‘*kubectl get nodes*’* shows nodes in Ready state. +* No workloads stuck in Pending state after uncordoning. + +**Rollback** +* If maintenance fails, uncordon the node to resume workload scheduling. + +### 4. Restart Kafka Brokers (CFK) + +**Purpose** +To restart Kafka broker pods running under Confluent for Kubernetes without causing downtime. + +**Prerequisites** +* kubectl access with namespace permissions for CFK +* Confirm rolling restart strategy in Kafka configuration + +**Procedure** +1. Identify broker pods: + ```$> kubectl get pods -n -l app=kafka``` +2. Restart a broker (one at a time): +```$> kubectl delete pod -n ``` +3. Wait for the pod to be recreated and reach Running state before restarting the next broker. + +**Validation** + +* Verify all brokers are in Running state. +```kubectl get pods``` +* Kafka client tests confirm no data loss or downtime. + +**Rollback** +* If restart causes issues, redeploy from last known good configuration or scale from backup node pool. + +### 5. Download Logs + +**Purpose** +To collect relevant logs for troubleshooting incidents within agreed SLA timelines. + +**Prerequisites** + +* kubectl access to affected namespaces +* Storage location for saving logs + +**Procedure** + +1. Get logs for a specific pod (complete logs): +```$> kubectl logs -n ``` +2. Get logs for last 2 hrs for a specific pod: +```$> kubectl lgos -n --since=2h``` +3. Get logs for last 1 hr for a specific pod: +```$> kubectl lgos -n --since=1h``` +2. For a previous container instance (if it restarted): +```$> kubectl logs -n --previous``` +3. Export logs to a file: +```$> kubectl logs -n > /tmp/.log``` +4. Compress and send logs to the incident tracker or client: +```$> tar -czvf logs.tar.gz /tmp/*.log``` + +**Validation** + +* Logs are complete for the specified time period. +* Files are accessible to relevant teams. + +**Rollback** +* Not applicable; log retrieval is read-only. + +### 6. System Status Assessment + +**Purpose** +To perform a complete health assessment of the Kubernetes cluster and Confluent Platform components during incidents or routine audits. + +**Prerequisites** +* kubectl access with read permissions to all relevant namespaces +* Access to monitoring dashboards (Grafana, Prometheus, Confluent Control Center) + +**Procedure** +1. Check Kubernetes node health: +```$> kubectl get nodes -o wide``` +2. Check pod status for all namespaces: +```$> kubectl get pods --all-namespaces``` +3. Check Kafka broker status in Confluent Control Center or via CLI: +```$> kafka-broker-api-versions --bootstrap-server ``` +4. Check Kafka Controller Quorum (KRaft mode): +```kubectl exec -it -n -- kafka-metadata-quorum.sh describe --status``` +5. Review cluster events for anomalies: +```$> kubectl get events --sort-by=.metadata.creationTimestamp``` + +**Validation** +* All nodes are Ready. +* No critical pods in CrashLoopBackOff or Pending. +* Kafka brokers and Zookeeper nodes report healthy status. + +**Rollback** +* Not applicable; assessment is read-only. + +### 7. JMX Metrics Retrieval (Kafka & CFK) + +**Purpose** + To gather JMX metrics from Kafka brokers for performance and health analysis. + +**Prerequisites** +* JMX enabled in Kafka configuration +* Access to JMX exporter endpoint or port-forward capability + +**Procedure** +1. Identify the broker pod to collect metrics from: +```$> kubectl get pods -n -l app=kafka ``` +2. Port-forward JMX port (e.g., 5555\) to local machine: +```$> kubectl port-forward -n 5555:5555``` +3. Use JMX client (e.g., jconsole, jmxterm) to connect: +```jconsole localhost:5555``` +4. Navigate through MBeans to retrieve metrics such as: + ```sh + kafka.server:type=BrokerTopicMetrics + kafka.network:type=RequestMetrics + kafka.controller:type=KafkaController + ``` + +**Validation** +* Metrics are accessible without errors. +* Data matches expectations based on workload. + +**Rollback** +* Close port-forward session and disconnect JMX client. + +### 8. Enhanced / Extended Alerts Setup + +**Purpose** +To configure and validate monitoring alerts for proactive issue detection. + +**Prerequisites** +* Access to Prometheus and Alertmanager configuration +* Access to Slack, email, or PagerDuty for notifications + +**Procedure** +1. Review existing alert rules in Prometheus: +```$> kubectl get configmap prometheus-server -n -o yaml``` +2. Add new alert rules (e.g., high broker CPU, partition under-replicated): +```Update Prometheus rule files with thresholds. ``` +3. Reload Prometheus configuration: +```$> kubectl delete pod -n ``` +4. Test alert by simulating conditions (e.g., scale down a broker). +5. Verify alert delivery to configured channels (Slack/PagerDuty). + +**Validation** +* Alert triggers when threshold is breached. +* Notifications are received in the correct channel. + +**Rollback** +* Restore previous alert rule configuration from backup. + +### 9. Cluster Creation + +**Purpose** +To provision new Kubernetes clusters or namespaces for CFK workloads and ensure smooth daily operations. + +**Prerequisites** +* Approved cluster sizing plan +* Access to infrastructure provisioning tool (e.g., Terraform, eksctl, gcloud CLI, kubectl) +* Network and security configurations approved + +**Procedure** + +1. Provision the cluster using the approved method (Terraform, eksctl, gcloud CLI, etc.). +2. Configure RBAC roles and service accounts for application teams. +3. Deploy CFK Operator and required CRDs. +4. Deploy Kafka, Zookeeper, Schema Registry, Connect, and other Confluent components. +5. Verify deployments by checking pod status and running test workloads. +6. Document cluster details and update BAU runbook. + +**Validation** +* All components are deployed and running. +* Cluster passes initial health checks. + +**Rollback** +* Destroy the cluster using the same tool (terraform destroy, eksctl delete cluster, etc.) if provisioning fails. + + +### 10. Credential Addition + +**Purpose** + To securely add or rotate credentials for Kafka, Schema Registry, Connect, or Kubernetes RBAC accounts. + +**Prerequisites** + +* Approval from change management +* Access to secret management system (Kubernetes Secrets, HashiCorp Vault, etc.) + +**Procedure** +1. Create or update Kubernetes Secret: +```kubectl create secret generic --from-literal=username= --from-literal=password= -n ``` +2. Patch the deployment or StatefulSet to mount the updated secret. +3. Restart affected pods if they don’t pick up new secrets automatically. + +**Validation** +* Applications authenticate successfully using new credentials. + +**Rollback** +* Restore previous secret from backup or Vault. + +### 11. Confluent Platform (CP) Upgrades + +**Purpose** + To upgrade CFK-managed Confluent components to a newer supported version. + +**Prerequisites** +* Compatibility check completed +* Backup of data and configurations +* Maintenance window approved + +**Procedure** +1. Review release notes for the target CFK version (check compatibility with Kubernetes and CP) +2. Backup existing CFK configuration: +```$\> kubectl get confluent \-n \ \-o yaml \> cfk-all-backup.yaml``` +3. Upgrade the CFK Operator Helm chart or manifest: + * For Helm: + ```$> helm upgrade cfk confluentinc/confluent-for-kubernetes --version ``` + * For YAML manifests: apply the updated manifest +4. Monitor Operator logs to ensure no failures: +```kubectl logs -n deploy/confluent-operator``` +5. Sequentially upgrade platform components (Kafka, Zookeeper, Connect, SR, etc.) using the updated CRDs. +6. Validate: + * Check cluster health via: +```$> kubectl get pods``` + * Verify Kafka broker status (logs, metrics, JMX). + * Run smoke tests (produce/consume to a test topic). + +**Expected Outcome**: +* Operator and all Confluent Platform pods run with the new version. +* No disruption to workloads beyond expected rolling restart downtime. + +**Validation** +* All components run the new version. +* No errors in logs after upgrade. + +### 12. Operator Upgrades + +**Purpose** +To upgrade the Confluent for Kubernetes (CFK) operator to the latest stable version. + +**Prerequisites** +* Review release notes for breaking changes +* Backup existing CRDs and configurations + +**Procedure** +1. Download the updated operator manifest from Confluent’s repository. +2. Apply the updated manifest: +```$> kubectl apply -f .yaml``` +3. Monitor operator pod restart and logs. + +**Validation** +* Operator pod runs the new version. +* Managed resources are unaffected. + +**Rollback** +* Reapply previous operator manifest from backup. + +### 13. Rollback Procedure +**Purpose** +Safely revert to a previous stable CFK version if upgrade fails. + +**Procedure** +1. Identify previous working version of CFK. +2. Roll back Operator: + * Helm: ```helm rollback cfk ``` + * YAML: apply the last known good manifest. +3. Restore backup CRDs and custom resources from pre-upgrade YAMLs. +4. Restart affected pods if needed: + ```$> kubectl rollout restart statefulset ``` +5. Verify Confluent Platform cluster stability: + * Brokers should form quorum. + * Schema Registry, Connect, ksqlDB should serve requests. +6. Communicate rollback action to stakeholders. + +**Expected Outcome**: +* Cluster returns to the last known working version. +* No lingering upgrade artifacts or partial deployments. + +**Monitoring and Validation** + +* **During upgrade/rollback**: + * Watch `kubectl get pods -w` for rolling restart progress. + * Monitor Kafka broker logs for rebalancing or ISR shrinkage. + * Track Prometheus/Grafana dashboards (CPU, memory, partitions, lag). + +* **Post change**: + * Confirm health checks + * Validate Confluent Control Center dashboards (if deployed) + * Run integration tests (producers/consumers, connector tasks) From a426c69b71a86c303365d28a3da84f627224e757 Mon Sep 17 00:00:00 2001 From: Balaji632 Date: Thu, 21 Aug 2025 11:33:34 +0530 Subject: [PATCH 2/5] Minor correction with formatting. Removed Table of contents --- operations/day2sop.md | 29 +---------------------------- 1 file changed, 1 insertion(+), 28 deletions(-) diff --git a/operations/day2sop.md b/operations/day2sop.md index 5a2225a..ff88f0e 100644 --- a/operations/day2sop.md +++ b/operations/day2sop.md @@ -5,33 +5,6 @@ layout: Operations --- -**Table of Contents** - -[ Introduction](#heading=) -[Scope](#heading=) -     [In Scope](#2.1-in-scope) -     [Out of Scope](#2.2-out-of-scope) - -[Day 2 Operations Overview](#heading=) -[Support Model & SLA5](#4.-support-model-&-sla) -     [Support Model Structure](#heading=) -     [Communication Channels](#heading=) -     [SLA Definitions](#heading=) -[Kubernetes Day 2 SOPs](#5.-kubernetes-day-2-sops) -     [1 Cluster Health Verification](#5.1-cluster-health-verification) -     [2 Namespace Creation & Resource Quota Adjustment7](#5.2-namespace-creation-&-resource-quota-adjustment) -     [5.3 Node Maintenance & Drain](#5.3-node-maintenance-&-drain) -     [5.4 Restart Kafka Brokers (CFK)](#5.4-restart-kafka-brokers-\(cfk\)) -     [5.5 Retrieve Logs (within 1 \- 2 hour SLA)](#5.5-retrieve-logs-\(within-1---2-hour-sla\)) -     [5.6 System Status Assessment](#5.6-system-status-assessment) -     [5.7 JMX Metrics Retrieval (Kafka & CFK)](#5.7-jmx-metrics-retrieval-\(kafka-&-cfk\)) -     [5.8 Enhanced / Extended Alerts Setup](#5.8-enhanced-/-extended-alerts-setup) -     [5.9 Cluster Creation](#5.9-cluster-creation) -     [5.10 Credential Addition](#5.10-credential-addition) -     [5.11 Confluent Platform (CP) Upgrades](#5.11-confluent-platform-\(cp\)-upgrades) -     [5.12 Operator Upgrades](#5.12-operator-upgrades) -     [5.13 Rollback Procedure](#5.13-rollback-procedure) -     [5.14 Download Logs](#5.14-download-logs) # Introduction @@ -464,7 +437,7 @@ To provision new Kubernetes clusters or namespaces for CFK workloads and ensure **Procedure** 1. Review release notes for the target CFK version (check compatibility with Kubernetes and CP) 2. Backup existing CFK configuration: -```$\> kubectl get confluent \-n \ \-o yaml \> cfk-all-backup.yaml``` +```$> kubectl get confluent -n -o yaml > cfk-all-backup.yaml``` 3. Upgrade the CFK Operator Helm chart or manifest: * For Helm: ```$> helm upgrade cfk confluentinc/confluent-for-kubernetes --version ``` From e7b06682c745b9b61572334c6d61ecfb6e69a340 Mon Sep 17 00:00:00 2001 From: Balaji632 Date: Thu, 21 Aug 2025 11:38:09 +0530 Subject: [PATCH 3/5] Corrected formatting --- operations/day2sop.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/operations/day2sop.md b/operations/day2sop.md index ff88f0e..bd7a453 100644 --- a/operations/day2sop.md +++ b/operations/day2sop.md @@ -141,18 +141,18 @@ This section documents step-by-step procedures for core Kubernetes operational t **Procedure** 1. Check node readiness and nsure all nodes have STATUS=Ready. - ```kubectl get nodes -o wide``` +```kubectl get nodes -o wide``` 2. Check control plane components and verify etcd, scheduler, and controller-manager are healthy. - ```$> kubectl get componentstatuses``` +```$> kubectl get componentstatuses``` 3. Check pod health across namespaces. No pods should be stuck in CrashLoopBackOff or Pending state. - ```$> kubectl get pods --all-namespaces``` +```$> kubectl get pods --all-namespaces``` 4. Check persistent volumes. All PVs should be in Bound state. - ```$> kubectl get pv``` +```$> kubectl get pv``` 5. Check cluster events - ```$> kubectl get events --sort-by=.metadata.creationTimestamp``` +```$> kubectl get events --sort-by=.metadata.creationTimestamp``` **Validation** * All nodes show Ready status. @@ -173,7 +173,7 @@ This section documents step-by-step procedures for core Kubernetes operational t **Procedure** 1. Create a new namespace - ```$\> kubectl create namespace \``` +```$\> kubectl create namespace \``` 2. Apply a resource quota ``` cat \<\ kubectl edit resourcequota -n ``` **Validation** + ```$> kubectl describe namespace ``` ```$> kubectl describe resourcequota -n ``` + Check that the quota matches approved values. **Rollback** * To delete namespace: ```$> kubectl delete namespace ``` + * To revert quota changes, restore from the previous YAML definition. ### 3. Node Maintenance & Drain {#5.3-node-maintenance-&-drain} From 8ba58b5dddd199d298b4cbe9b0aeebda3d3f1836 Mon Sep 17 00:00:00 2001 From: Balaji632 Date: Thu, 21 Aug 2025 11:44:27 +0530 Subject: [PATCH 4/5] Corrected formatting in Cluster Health Verification --- operations/day2sop.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/operations/day2sop.md b/operations/day2sop.md index bd7a453..811440d 100644 --- a/operations/day2sop.md +++ b/operations/day2sop.md @@ -140,7 +140,7 @@ This section documents step-by-step procedures for core Kubernetes operational t * Access to monitoring tools (Prometheus/Grafana or equivalent) **Procedure** -1. Check node readiness and nsure all nodes have STATUS=Ready. +1. Check node readiness and ensure all nodes have STATUS=Ready. ```kubectl get nodes -o wide``` 2. Check control plane components and verify etcd, scheduler, and controller-manager are healthy. @@ -151,6 +151,7 @@ This section documents step-by-step procedures for core Kubernetes operational t 4. Check persistent volumes. All PVs should be in Bound state. ```$> kubectl get pv``` + 5. Check cluster events ```$> kubectl get events --sort-by=.metadata.creationTimestamp``` @@ -172,8 +173,8 @@ This section documents step-by-step procedures for core Kubernetes operational t * Approved change request **Procedure** -1. Create a new namespace -```$\> kubectl create namespace \``` +1. Create a new namespace: +```$> kubectl create namespace ``` 2. Apply a resource quota ``` cat \<\ kubectl delete namespace ``` * To revert quota changes, restore from the previous YAML definition. @@ -226,7 +228,7 @@ Check that the quota matches approved values. 3. Perform maintenance (OS updates, hardware replacement, etc.). 4. Bring node back into scheduling: - $\> kubectl uncordon \ + $> kubectl uncordon **Validation** From 4b76d455ed2b4aca6b801611f8308143814b6924 Mon Sep 17 00:00:00 2001 From: balaji Date: Wed, 17 Sep 2025 15:32:27 +0530 Subject: [PATCH 5/5] Updated day2 sop with review comments. --- operations/day2sop.md | 645 +++++++++++++++++++++++++++++++----------- 1 file changed, 479 insertions(+), 166 deletions(-) diff --git a/operations/day2sop.md b/operations/day2sop.md index 811440d..f60e39a 100644 --- a/operations/day2sop.md +++ b/operations/day2sop.md @@ -1,11 +1,10 @@ --- -title: Kubernetes Day2 Standard Operating Procedures (SOP) +#title: Confluent Platform Day2 Standard Operating Procedures (SOP) nav_order: 4 -layout: Operations +parent: Operations --- - # Introduction This document serves as the Day 2 Operations Runbook for managing Kubernetes clusters deployed with Confluent for Kubernetes (CFK) in a production environment. It contains detailed Standard Operating Procedures (SOPs) for operational tasks, incident handling, and lifecycle management to ensure the cluster and Confluent Platform components remain secure, stable, and performant after initial deployment. @@ -141,19 +140,29 @@ This section documents step-by-step procedures for core Kubernetes operational t **Procedure** 1. Check node readiness and ensure all nodes have STATUS=Ready. -```kubectl get nodes -o wide``` +```bash +$> kubectl get nodes -o wide +``` 2. Check control plane components and verify etcd, scheduler, and controller-manager are healthy. -```$> kubectl get componentstatuses``` +```bash +$> kubectl get componentstatuses +``` 3. Check pod health across namespaces. No pods should be stuck in CrashLoopBackOff or Pending state. -```$> kubectl get pods --all-namespaces``` +```bash +$> kubectl get pods --all-namespaces +``` 4. Check persistent volumes. All PVs should be in Bound state. -```$> kubectl get pv``` +```bash +$> kubectl get pv +``` 5. Check cluster events -```$> kubectl get events --sort-by=.metadata.creationTimestamp``` +```bash +$> kubectl get events --sort-by=.metadata.creationTimestamp +``` **Validation** * All nodes show Ready status. @@ -163,52 +172,7 @@ This section documents step-by-step procedures for core Kubernetes operational t **Rollback** * Not applicable; this SOP is for verification only. -### 2. Namespace Creation & Resource Quota Adjustment - -**Purpose** - To create new Kubernetes namespaces or modify quotas for existing workloads while maintaining fair resource allocation. - -**Prerequisites** -* kubectl access with appropriate RBAC permissions -* Approved change request - -**Procedure** -1. Create a new namespace: -```$> kubectl create namespace ``` -2. Apply a resource quota - ``` - cat \<\ - namespace: \ - spec: - hard: - cpu: "10" - memory: 32Gi - pods: "50" - EOF - ``` -**Update quota if needed** - -```$> kubectl edit resourcequota -n ``` - -**Validation** - -```$> kubectl describe namespace ``` -```$> kubectl describe resourcequota -n ``` - -Check that the quota matches approved values. - -**Rollback** -* To delete namespace: - -```$> kubectl delete namespace ``` - -* To revert quota changes, restore from the previous YAML definition. - -### 3. Node Maintenance & Drain {#5.3-node-maintenance-&-drain} +### 2. Node Maintenance & Drain {#5.3-node-maintenance-&-drain} **Purpose** To safely perform maintenance on a Kubernetes node (e.g., kernel patching, hardware replacement) without impacting running workloads. @@ -216,15 +180,22 @@ Check that the quota matches approved values. **Prerequisites** * kubectl access with cluster-admin privileges -* Maintenance window approved in change management system - +* Maintenance window approved in change management system +* Ensure sufficient cluster capacity to reschedule workloads from the node being drained: + * Add a new node or confirm that existing nodes have enough spare resources. + * If workloads include Kafka pods, verify pod anti-affinity rules are satisfied — checking capacity alone may not be sufficient to guarantee successful rescheduling. + **Procedure** 1. Mark node as unschedulable: -```$> kubectl cordon ``` +```bash +$> kubectl cordon +``` 2. Drain workloads from the node: -```$> kubectl drain --ignore-daemonsets --delete-emptydir-data``` +```bash +$> kubectl drain --ignore-daemonsets --delete-emptydir-data +``` 3. Perform maintenance (OS updates, hardware replacement, etc.). 4. Bring node back into scheduling: @@ -238,7 +209,7 @@ Check that the quota matches approved values. **Rollback** * If maintenance fails, uncordon the node to resume workload scheduling. -### 4. Restart Kafka Brokers (CFK) +### 3. Restart Kafka Brokers (CFK) **Purpose** To restart Kafka broker pods running under Confluent for Kubernetes without causing downtime. @@ -248,22 +219,57 @@ To restart Kafka broker pods running under Confluent for Kubernetes without caus * Confirm rolling restart strategy in Kafka configuration **Procedure** + +*Option A – Restart all brokers sequentially using StatefulSet rollout (preferred for full restart):* + +1. Identify the StatefulSet managing Kafka brokers: +```bash +$> kubectl get statefulsets -n +``` +2. Trigger a rolling restart: +```bash +$> kubectl rollout restart statefulset -n +``` +3. Monitor rollout status: +```bash +$> kubectl rollout status statefulset -n +``` + +**Validation** + +* Verify all brokers are in Running state: +```bash +$> kubectl get pods -n -l app=kafka +``` +* Kafka client tests confirm no data loss or downtime. + +**Rollback** +* If restart causes issues, redeploy from last known good configuration or scale from backup node pool. +*Option B – Restart individual brokers manually (one at a time):* 1. Identify broker pods: - ```$> kubectl get pods -n -l app=kafka``` +```bash + $> kubectl get pods -n -l app=kafka +``` 2. Restart a broker (one at a time): -```$> kubectl delete pod -n ``` +```bash +$> kubectl delete pod -n +``` 3. Wait for the pod to be recreated and reach Running state before restarting the next broker. **Validation** * Verify all brokers are in Running state. -```kubectl get pods``` +```bash +$> kubectl get pods +``` * Kafka client tests confirm no data loss or downtime. + + **Rollback** * If restart causes issues, redeploy from last known good configuration or scale from backup node pool. -### 5. Download Logs +### 4. Download Logs **Purpose** To collect relevant logs for troubleshooting incidents within agreed SLA timelines. @@ -276,17 +282,29 @@ To collect relevant logs for troubleshooting incidents within agreed SLA timelin **Procedure** 1. Get logs for a specific pod (complete logs): -```$> kubectl logs -n ``` +```bash +$> kubectl logs -n +``` 2. Get logs for last 2 hrs for a specific pod: -```$> kubectl lgos -n --since=2h``` +```bash +$> kubectl lgos -n --since=2h +``` 3. Get logs for last 1 hr for a specific pod: -```$> kubectl lgos -n --since=1h``` +```bash +$> kubectl lgos -n --since=1hbash +``` 2. For a previous container instance (if it restarted): -```$> kubectl logs -n --previous``` +```bash +$> kubectl logs -n --previousbash +``` 3. Export logs to a file: -```$> kubectl logs -n > /tmp/.log``` +```bash +$> kubectl logs -n > /tmp/.logbash +``` 4. Compress and send logs to the incident tracker or client: -```$> tar -czvf logs.tar.gz /tmp/*.log``` +```bash +$> tar -czvf logs.tar.gz /tmp/*.logbash +``` **Validation** @@ -296,7 +314,7 @@ To collect relevant logs for troubleshooting incidents within agreed SLA timelin **Rollback** * Not applicable; log retrieval is read-only. -### 6. System Status Assessment +### 5. System Status Assessment **Purpose** To perform a complete health assessment of the Kubernetes cluster and Confluent Platform components during incidents or routine audits. @@ -307,15 +325,25 @@ To perform a complete health assessment of the Kubernetes cluster and Confluent **Procedure** 1. Check Kubernetes node health: -```$> kubectl get nodes -o wide``` +```bash +$> kubectl get nodes -o wide +``` 2. Check pod status for all namespaces: -```$> kubectl get pods --all-namespaces``` +```bash +$> kubectl get pods --all-namespacesbash +``` 3. Check Kafka broker status in Confluent Control Center or via CLI: -```$> kafka-broker-api-versions --bootstrap-server ``` +```bash +$> kafka-broker-api-versions --bootstrap-server +bash``` 4. Check Kafka Controller Quorum (KRaft mode): -```kubectl exec -it -n -- kafka-metadata-quorum.sh describe --status``` +```bash +%> kubectl exec -it -n -- kafka-metadata-quorum.sh describe --status +``` 5. Review cluster events for anomalies: -```$> kubectl get events --sort-by=.metadata.creationTimestamp``` +```bash +$> kubectl get events --sort-by=.metadata.creationTimestamp +``` **Validation** * All nodes are Ready. @@ -325,28 +353,16 @@ To perform a complete health assessment of the Kubernetes cluster and Confluent **Rollback** * Not applicable; assessment is read-only. -### 7. JMX Metrics Retrieval (Kafka & CFK) +### 6. JMX Metrics Retrieval (Kafka & CFK) **Purpose** - To gather JMX metrics from Kafka brokers for performance and health analysis. + To perform health check on Kafak Brokers for performance. **Prerequisites** -* JMX enabled in Kafka configuration -* Access to JMX exporter endpoint or port-forward capability +* Streamtime UI **Procedure** -1. Identify the broker pod to collect metrics from: -```$> kubectl get pods -n -l app=kafka ``` -2. Port-forward JMX port (e.g., 5555\) to local machine: -```$> kubectl port-forward -n 5555:5555``` -3. Use JMX client (e.g., jconsole, jmxterm) to connect: -```jconsole localhost:5555``` -4. Navigate through MBeans to retrieve metrics such as: - ```sh - kafka.server:type=BrokerTopicMetrics - kafka.network:type=RequestMetrics - kafka.controller:type=KafkaController - ``` +1. **Validation** * Metrics are accessible without errors. @@ -355,7 +371,7 @@ To perform a complete health assessment of the Kubernetes cluster and Confluent **Rollback** * Close port-forward session and disconnect JMX client. -### 8. Enhanced / Extended Alerts Setup +### 7. Enhanced / Extended Alerts Setup **Purpose** To configure and validate monitoring alerts for proactive issue detection. @@ -366,11 +382,17 @@ To configure and validate monitoring alerts for proactive issue detection. **Procedure** 1. Review existing alert rules in Prometheus: -```$> kubectl get configmap prometheus-server -n -o yaml``` +```bash +$> kubectl get configmap prometheus-server -n -o yaml +``` 2. Add new alert rules (e.g., high broker CPU, partition under-replicated): -```Update Prometheus rule files with thresholds. ``` +```bash +Update Prometheus rule files with thresholds. +``` 3. Reload Prometheus configuration: -```$> kubectl delete pod -n ``` +```bash +$> kubectl delete pod -n +``` 4. Test alert by simulating conditions (e.g., scale down a broker). 5. Verify alert delivery to configured channels (Slack/PagerDuty). @@ -381,81 +403,385 @@ To configure and validate monitoring alerts for proactive issue detection. **Rollback** * Restore previous alert rule configuration from backup. -### 9. Cluster Creation +### 8. Credential Addition **Purpose** -To provision new Kubernetes clusters or namespaces for CFK workloads and ensure smooth daily operations. +To securely add or rotate credentials in Confluent for Kubernetes (CFK), use Kubernetes Secrets or integrate with Vault. Credential updates for Kafka, Schema Registry, Connect, or Kubernetes RBAC accounts follow a similar pattern. **Prerequisites** -* Approved cluster sizing plan -* Access to infrastructure provisioning tool (e.g., Terraform, eksctl, gcloud CLI, kubectl) -* Network and security configurations approved -**Procedure** +* Approval from change management +* Access to secret management system (Kubernetes Secrets, HashiCorp Vault, etc.) +* Knowledge of the authentication mechanism enabled (e.g., SASL/PLAIN, SASL/SCRAM, Basic Auth, mTLS) -1. Provision the cluster using the approved method (Terraform, eksctl, gcloud CLI, etc.). -2. Configure RBAC roles and service accounts for application teams. -3. Deploy CFK Operator and required CRDs. -4. Deploy Kafka, Zookeeper, Schema Registry, Connect, and other Confluent components. -5. Verify deployments by checking pod status and running test workloads. -6. Document cluster details and update BAU runbook. + +**Procedure** +**Using Kubernetes Secrects** + +*Kafka (CFK, SASL/PLAIN)* +1. Create a Kubernetes Secret with SASL/PLAIN user credentials: +```bash + kubectl create secret generic \ + --from-literal=username= \ + --from-literal=password= \ + -n +``` +2. Reference the secret in the Kafka CR under spec.kafka.authentication.type: plain. +3. Restart broker pods if changes are not picked up dynamically. + +*Kafka (CFK, SASL/SCRAM)* + +1. Create the user via the Confluent CLI or by updating the KafkaUser custom resource: +```bash +apiVersion: platform.confluent.io/v1beta1 +kind: KafkaUser +metadata: + name: + namespace: +spec: + authentication: + type: scram-sha-512 +``` +2. Apply the resource: +```bash +kubectl apply -f .yaml +``` + +*Schema Registry (CFK, Basic Auth)* + +1. Create a Kubernetes Secret for the Basic Auth credentials: +```bash +kubectl create secret generic \ + --from-literal=basic.username= \ + --from-literal=basic.password= \ + -n +``` +2. Update the Schema Registry CR to mount the secret under spec.config. +3. Restart the Schema Registry pod(s) if required. + + +*Redpanda (SASL/SCRAM)* + +1. Use rpk acl user create to add a new SASL/SCRAM user: +```bash +rpk acl user create -p --api-urls +``` + +2. Update client configurations with the new credentials. **Validation** -* All components are deployed and running. -* Cluster passes initial health checks. -**Rollback** -* Destroy the cluster using the same tool (terraform destroy, eksctl delete cluster, etc.) if provisioning fails. - +* Clients can authenticate using the new credentials. +* Relevant pods (Kafka, Schema Registry, Redpanda) show no authentication errors in logs. -### 10. Credential Addition +**Rollback** -**Purpose** - To securely add or rotate credentials for Kafka, Schema Registry, Connect, or Kubernetes RBAC accounts. +* Revert to the previous credentials from backup or Vault. +* Remove the faulty user/secret if authentication fails. -**Prerequisites** -* Approval from change management -* Access to secret management system (Kubernetes Secrets, HashiCorp Vault, etc.) +### 9. Upgrade Confluent For Kubernetes (CFK) -**Procedure** -1. Create or update Kubernetes Secret: -```kubectl create secret generic --from-literal=username= --from-literal=password= -n ``` -2. Patch the deployment or StatefulSet to mount the updated secret. -3. Restart affected pods if they don’t pick up new secrets automatically. +**Purpose** + To upgrade CFK-managed Confluent components to a newer supported version. The following paths are supported: + + Upgrade both Confluent Platform and CFK: + + Step 1. Upgrade CFK. + Step 2. Upgrade Confluent Platform Using Confluent for Kubernetes. -**Validation** -* Applications authenticate successfully using new credentials. -**Rollback** -* Restore previous secret from backup or Vault. +**Prerequisites** +* Compatibility check completed +* Backup of data and configurations +* Maintenance window approved -### 11. Confluent Platform (CP) Upgrades +**Procedure** +1. If upgrading CFK 2.x to 3.x to deploy and manage Confluent Platform 7.x, set the annotation for the components you want to use Log4j: +```bash +kubectl annotate \ + platform.confluent.io/use-log4j1=true \ + --namespace +``` +The ```platform.confluent.io/use-log4j1=true``` annotation is required to use Confluent Platform 7.x with CFK 3.0+. + +2. Disable resource reconciliation - To prevent Confluent Platform components from rolling restarts, temporarily disable resource reconciliation of the components in each namespace where the Confluent Platform is deployed, specifying the CR kinds and CR names (*whichever is applicable*): + +```bash +kubectl annotate connect connect \ + platform.confluent.io/block-reconcile=true \ + --namespace +``` +```bash +kubectl annotate controlcenter controlcenter \ + platform.confluent.io/block-reconcile=true \ + --namespace +``` +```bash +kubectl annotate kafkarestproxy kafkarestproxy \ + platform.confluent.io/block-reconcile=true \ + --namespace +``` +```bash +kubectl annotate kafka kafka \ + platform.confluent.io/block-reconcile=true \ + --namespace +``` +```bash +kubectl annotate ksqldb ksqldb \ + platform.confluent.io/block-reconcile=true \ + --namespace +``` +```bash +kubectl annotate schemaregistry schemaregistry \ + platform.confluent.io/block-reconcile=true \ + --namespace +``` +```bash +kubectl annotate kraftcontroller kraftcontroller \ + platform.confluent.io/block-reconcile=true \ + --namespace +``` + +3. Add the CFK Helm repo: +```bash +helm repo add confluentinc https://packages.confluent.io/helm +``` +```bash +helm repo update +``` + +4. Get the CFK chart +* From the Helm repo + + a. Get the latest CFK chart +```bash +helm pull confluentinc/confluent-for-kubernetes --untar +``` + + b. Get a specific version of the CFK chart +```bash +helm pull confluentinc/confluent-for-kubernetes --version --untar +``` + +5. Upgrade Confluent Platform custom resource definitions (CRDs) +```bash +kubectl apply -f confluent-for-kubernetes/crds/ +``` + +**If the command returns an error similar to the below:** +```bash +The CustomResourceDefinition "kafkas.platform.confluent.io" is invalid: +metadata.annotations: Too long: must have at most 262144 bytes make: *** +[install-crds] Error 1 +``` + +Run the following command: +```bash +kubectl apply --server-side=true -f +``` + +**If running kubectl apply with the --server-side=true flag returns an error similar to the below:** +```bash +Apply failed with 1 conflict: conflict with "helm" using +apiextensions.k8s.io/v1: .spec.versions Please review the fields +above--they currently have other managers. +``` + +Run kubectl apply with an additional flag, --force-conflicts: + +```bash +kubectl apply --server-side=true --force-conflicts -f +``` + +6. Upgrade CFK + +Find the default values.yaml file: +```bash +mkdir -p + +helm pull confluentinc/confluent-for-kubernetes \ + --untar \ + --untardir= \ + --namespace +``` + +The `values.yaml` file is in the `/confluent-for-kubernetes` directory. Create a copy of the `values.yaml` file to customize CFK configuration. Do not edit the default `values.yaml` file. + +Save your copy to any file location; we will refer to this location as ``. Open `values.yaml` and modify parameter `namespaced: false` + +In values.yaml, set parameter `namespaced: false` + +**Install CFK using the customized configuration** +```bash +helm upgrade --install confluent-operator \ + confluentinc/confluent-for-kubernetes \ + --values \ + --namespace +``` + +7. Upgrade CFK to a specific version, such as a hotfix or a patch version (if applicable) - *If Applicable* + +In values.yaml, update the CFK image.tag to the image tag of the CFK version specified in Confluent for Kubernetes image tags: + +```bash +image: + tag: "" +``` + +Run the following command: +```bash +helm upgrade --install confluent-operator \ + confluentinc/confluent-for-kubernetes \ + --values \ + --namespace +``` + + +8. Enable resource reconciliation for each Confluent Platform components that you disabled reconciliation in the first step above. +```bash +kubectl annotate \ + platform.confluent.io/block-reconcile- \ + --namespace +``` + +9. Upgrade CFK init container + +In each Confluent Platform component CR, update the CFK init container image tag to the version of CFK you are upgrading to + +```bash +kind: +spec: + image: + init: confluentinc/confluent-init-container: +``` + +### 9. Upgrade Confluent Platform Using Confluent for Kubernetes **Purpose** - To upgrade CFK-managed Confluent components to a newer supported version. + To upgrade CFK-managed Confluent Platform components to a newer supported version. **Prerequisites** * Compatibility check completed * Backup of data and configurations * Maintenance window approved +**NOTE** + +Upgrade KRaft-based Confluent Platform 7.x deployments in the following order: + +1. KRaft +2. Kafka +3. Other Confluent components, excluding Control Center and Control Center (Legacy), in any order +4. Control Center + + +Upgrade KRaft-based Confluent Platform 8.0.0 deployments in the following order. For more information, see Upgrade Control Center from 2.0 or 2.1 to 2.2 in Confluent Platform 8.0. + +1. Control Center +2. KRaft +3. Kafka +4. Other Confluent components, excluding Control Center and Control Center (Legacy), in any order + **Procedure** 1. Review release notes for the target CFK version (check compatibility with Kubernetes and CP) 2. Backup existing CFK configuration: -```$> kubectl get confluent -n -o yaml > cfk-all-backup.yaml``` -3. Upgrade the CFK Operator Helm chart or manifest: - * For Helm: - ```$> helm upgrade cfk confluentinc/confluent-for-kubernetes --version ``` - * For YAML manifests: apply the updated manifest -4. Monitor Operator logs to ensure no failures: -```kubectl logs -n deploy/confluent-operator``` -5. Sequentially upgrade platform components (Kafka, Zookeeper, Connect, SR, etc.) using the updated CRDs. -6. Validate: - * Check cluster health via: -```$> kubectl get pods``` - * Verify Kafka broker status (logs, metrics, JMX). - * Run smoke tests (produce/consume to a test topic). +```bash +$> kubectl get confluent -n -o yaml > cfk-all-backup.yaml +``` +3. Upgrade KRaft + +a. In the KRaftController CR, update the component image tag. The tag is the Confluent Platform release you want to upgrade to. +```bash +kind: KRaftController +spec: + image: + application: confluentinc/cp-server: +``` + +b. In the same KRaftController CR, verify that the CFK init container image tag has been updated during the CFK upgrade process. The image tag should be the current version of CFK +```bash +spec: + image: + init: confluentinc/confluent-init-container:3.0.0 +``` + +c. Upgrade KRaft +``` +kubectl apply -f --name +``` + +4. Upgrade Kafka + +Upgrade Kafka that is deployed in the KRaft mode to the latest version in the following steps: + +a. In the Kafka CR, update the component image tag. The tag is the Confluent Platform release you want to upgrade to. +```bash +kind: Kafka +spec: + image: + application: confluentinc/cp-server: +``` + +b. In the same Kafka CR, verify that the CFK init container image tag has been updated during the CFK upgrade process. The image tag should be the current version of CFK +```bash +spec: + image: + init: confluentinc/confluent-init-container:3.0.0 +``` + +c. Upgrade Kafka +```bash +kubectl apply -f --name +``` + +5. Update metadata version of KRaft and Kafka +After verifying that the cluster behavior and performance meet your expectations, increment the metadata version for the controllers and brokers by running the kafka-features tool with the upgrade argument: + +```bash +./bin/kafka-features upgrade --bootstrap-server --metadata 4.0 +``` + +6. Upgrade other Confluent Platform components + +Upgrade Confluent Platform components as below: + +a. In the component CR, update the component image tag. The tag is the Confluent Platform release you want to upgrade to: +```bash +spec: + image: + application: : +``` + +b. If upgrading Control Center, specify the Control Center release as the Control Center image tag, the Prometheus image tag, and the Alertmanager image tag in the ControlCenter CR. Control Center is on independent versions and does not follow Confluent Platform releases. + + is the Control Center release you are installing. + +```bash +kind: ControlCenter +spec: + image: + application: confluentinc/cp-enterprise-control-center-next-gen: + init: confluentinc/confluent-init-container:3.0.0 + services: + prometheus: + image: confluentinc/cp-enterprise-prometheus: + pvc: + dataVolumeCapacity: 10Gi + alertmanager: + image: confluentinc/cp-enterprise-alertmanager: +``` +c. In the same component CR, verify that the CFK init container image tag has been updated during the CFK upgrade process. The image tag should be the current version of CFK +```bash +spec: + image: + init: confluentinc/confluent-init-container:3.0.0 +``` + +d. Upgrade the component +```bash +kubectl apply -f --name +``` **Expected Outcome**: * Operator and all Confluent Platform pods run with the new version. @@ -465,40 +791,23 @@ To provision new Kubernetes clusters or namespaces for CFK workloads and ensure * All components run the new version. * No errors in logs after upgrade. -### 12. Operator Upgrades - -**Purpose** -To upgrade the Confluent for Kubernetes (CFK) operator to the latest stable version. - -**Prerequisites** -* Review release notes for breaking changes -* Backup existing CRDs and configurations +### 12. Rollback Procedure -**Procedure** -1. Download the updated operator manifest from Confluent’s repository. -2. Apply the updated manifest: -```$> kubectl apply -f .yaml``` -3. Monitor operator pod restart and logs. - -**Validation** -* Operator pod runs the new version. -* Managed resources are unaffected. - -**Rollback** -* Reapply previous operator manifest from backup. - -### 13. Rollback Procedure **Purpose** Safely revert to a previous stable CFK version if upgrade fails. **Procedure** 1. Identify previous working version of CFK. 2. Roll back Operator: - * Helm: ```helm rollback cfk ``` + * Helm: ```bash + helm rollback cfk + ``` * YAML: apply the last known good manifest. 3. Restore backup CRDs and custom resources from pre-upgrade YAMLs. 4. Restart affected pods if needed: - ```$> kubectl rollout restart statefulset ``` + ```bash + $> kubectl rollout restart statefulset + ``` 5. Verify Confluent Platform cluster stability: * Brokers should form quorum. * Schema Registry, Connect, ksqlDB should serve requests. @@ -519,3 +828,7 @@ Safely revert to a previous stable CFK version if upgrade fails. * Confirm health checks * Validate Confluent Control Center dashboards (if deployed) * Run integration tests (producers/consumers, connector tasks) + + +[//]: # + [Upgrade CFK]: