[Bug]:Kafka Pods in CrashLoopBackOff State - Strimzi Kafka Version 0.39 #11078
mudasirhaji
started this conversation in
General
Replies: 1 comment 9 replies
-
Sharing the full configurations and logs would be a good start. |
Beta Was this translation helpful? Give feedback.
9 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Bug Description
We are facing an issue with the Kafka pods in the s4-cluster (namespace smartsafety4s-qa). The Kafka pods are in a CrashLoopBackOff or Error state, preventing the Kafka cluster from starting up. The issue started occurring after upgrading to Strimzi Kafka version 0.39.
Current Pod Status:
kubectl get po -n smartsafety4s-qa | grep s4
s4-cluster-kafka-0 0/1 CrashLoopBackOff 10 28m
s4-cluster-kafka-1 0/1 CrashLoopBackOff 14 32m
s4-cluster-kafka-2 0/1 Error 14 31m
s4-cluster-zookeeper-0 0/1 Running 0 29m
s4-cluster-zookeeper-1 1/1 Running 0 29m
s4-cluster-zookeeper-2 1/1 Running 0 29m
Attempted Configuration Fixes:
Liveness and Readiness Probes Fix: We attempted to add livenessProbe and readinessProbe configurations in the Kafka and Zookeeper templates under the containers section, per Strimzi's recommended structure.
Removed Incorrect Fields: We corrected invalid fields such as spec and brokers that were mistakenly added to the configuration file.
Kubernetes Version:
Version: 1.21
Strimzi Kafka Version:
0.39
Additional Information:
We have verified that Zookeeper pods are running correctly, and the issue seems to be isolated to Kafka pods.
Any guidance or assistance on resolving this issue with the new version of Strimzi would be appreciated.
Logs:
kafka pod logs:
2025-01-26 11:01:54,321 ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) [main]
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
at kafka.zookeeper.ZooKeeperClient.(ZooKeeperClient.scala:116)
at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:2266)
at kafka.zk.KafkaZkClient$.createZkClient(KafkaZkClient.scala:2358)
at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:658)
at kafka.server.KafkaServer.startup(KafkaServer.scala:222)
at kafka.Kafka$.main(Kafka.scala:113)
at kafka.Kafka.main(Kafka.scala)
2025-01-26 11:01:54,322 INFO shutting down (kafka.server.KafkaServer) [main]
2025-01-26 11:01:54,328 INFO App info kafka.server for 0 unregistered (org.apache.kafka.common.utils.AppInfoParser) [main]
2025-01-26 11:01:54,329 INFO shut down completed (kafka.server.KafkaServer) [main]
2025-01-26 11:01:54,329 ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$) [main]
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
at kafka.zookeeper.ZooKeeperClient.(ZooKeeperClient.scala:116)
at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:2266)
at kafka.zk.KafkaZkClient$.createZkClient(KafkaZkClient.scala:2358)
at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:658)
at kafka.server.KafkaServer.startup(KafkaServer.scala:222)
at kafka.Kafka$.main(Kafka.scala:113)
Zookeeper logs:
2025-01-26 11:03:36,409 WARN Cannot open channel to 1 at election address s4-cluster-zookeeper-0.s4-cluster-zookeeper-nodes.smartsafety4s-qa.svc/10.42.0.14:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [QuorumConnectionThread-[myid=2]-71]
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:554)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
2025-01-26 11:03:36,410 INFO Notification: my state:LOOKING; n.sid:3, n.state:LEADING, n.leader:3, n.round:0xca, n.peerEpoch:0x8c, n.zxid:0x8c00000000, message format version:0x2, n.config version:0x8c00000000 (org.apache.zookeeper.server.quorum.FastLeaderElection) [WorkerReceiver[myid=2]]
2025-01-26 11:03:36,410 INFO Oracle indicates not to follow (org.apache.zookeeper.server.quorum.FastLeaderElection) [QuorumPeermyid=2(secure=0.0.0.0:2181)]
Steps to reproduce
Steps to Reproduce:
Install or upgrade Strimzi Kafka to version 0.39.
Apply the YAML configuration for the s4-cluster Kafka and Zookeeper.
Observe the Kafka pods being stuck in the CrashLoopBackOff or Error state.
Expected Behavior:
Kafka and Zookeeper pods should transition to the Running state without errors.
Expected behavior
Kafka and Zookeeper pods should transition to the Running state without errors.
Strimzi version
0.39
Kubernetes version
1.21
Installation method
Strimzi Operator
Infrastructure
Bare-Metal
Configuration files and logs
No response
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions