[Bug]:Kafka Pods in CrashLoopBackOff State - Strimzi Kafka Version 0.39 #11078

mudasirhaji · 2025-01-26T11:04:56Z

mudasirhaji
Jan 26, 2025

Bug Description

We are facing an issue with the Kafka pods in the s4-cluster (namespace smartsafety4s-qa). The Kafka pods are in a CrashLoopBackOff or Error state, preventing the Kafka cluster from starting up. The issue started occurring after upgrading to Strimzi Kafka version 0.39.

Current Pod Status:
kubectl get po -n smartsafety4s-qa | grep s4
s4-cluster-kafka-0 0/1 CrashLoopBackOff 10 28m
s4-cluster-kafka-1 0/1 CrashLoopBackOff 14 32m
s4-cluster-kafka-2 0/1 Error 14 31m
s4-cluster-zookeeper-0 0/1 Running 0 29m
s4-cluster-zookeeper-1 1/1 Running 0 29m
s4-cluster-zookeeper-2 1/1 Running 0 29m

Attempted Configuration Fixes:

Liveness and Readiness Probes Fix: We attempted to add livenessProbe and readinessProbe configurations in the Kafka and Zookeeper templates under the containers section, per Strimzi's recommended structure.
Removed Incorrect Fields: We corrected invalid fields such as spec and brokers that were mistakenly added to the configuration file.

Kubernetes Version:
Version: 1.21
Strimzi Kafka Version:
0.39
Additional Information:
We have verified that Zookeeper pods are running correctly, and the issue seems to be isolated to Kafka pods.
Any guidance or assistance on resolving this issue with the new version of Strimzi would be appreciated.

Logs:
kafka pod logs:

2025-01-26 11:01:54,321 ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) [main]
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
at kafka.zookeeper.ZooKeeperClient.(ZooKeeperClient.scala:116)
at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:2266)
at kafka.zk.KafkaZkClient$.createZkClient(KafkaZkClient.scala:2358)
at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:658)
at kafka.server.KafkaServer.startup(KafkaServer.scala:222)
at kafka.Kafka$.main(Kafka.scala:113)
at kafka.Kafka.main(Kafka.scala)
2025-01-26 11:01:54,322 INFO shutting down (kafka.server.KafkaServer) [main]
2025-01-26 11:01:54,328 INFO App info kafka.server for 0 unregistered (org.apache.kafka.common.utils.AppInfoParser) [main]
2025-01-26 11:01:54,329 INFO shut down completed (kafka.server.KafkaServer) [main]
2025-01-26 11:01:54,329 ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$) [main]
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
at kafka.zookeeper.ZooKeeperClient.(ZooKeeperClient.scala:116)
at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:2266)
at kafka.zk.KafkaZkClient$.createZkClient(KafkaZkClient.scala:2358)
at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:658)
at kafka.server.KafkaServer.startup(KafkaServer.scala:222)
at kafka.Kafka$.main(Kafka.scala:113)

Zookeeper logs:

2025-01-26 11:03:36,409 WARN Cannot open channel to 1 at election address s4-cluster-zookeeper-0.s4-cluster-zookeeper-nodes.smartsafety4s-qa.svc/10.42.0.14:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [QuorumConnectionThread-[myid=2]-71]
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:554)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
2025-01-26 11:03:36,410 INFO Notification: my state:LOOKING; n.sid:3, n.state:LEADING, n.leader:3, n.round:0xca, n.peerEpoch:0x8c, n.zxid:0x8c00000000, message format version:0x2, n.config version:0x8c00000000 (org.apache.zookeeper.server.quorum.FastLeaderElection) [WorkerReceiver[myid=2]]
2025-01-26 11:03:36,410 INFO Oracle indicates not to follow (org.apache.zookeeper.server.quorum.FastLeaderElection) [QuorumPeermyid=2(secure=0.0.0.0:2181)]

Steps to reproduce

Steps to Reproduce:
Install or upgrade Strimzi Kafka to version 0.39.
Apply the YAML configuration for the s4-cluster Kafka and Zookeeper.
Observe the Kafka pods being stuck in the CrashLoopBackOff or Error state.
Expected Behavior:

Kafka and Zookeeper pods should transition to the Running state without errors.

Expected behavior

Kafka and Zookeeper pods should transition to the Running state without errors.

Strimzi version

0.39

Kubernetes version

1.21

Installation method

Strimzi Operator

Infrastructure

Bare-Metal

Configuration files and logs

No response

Additional context

No response

scholzj · 2025-01-26T12:58:22Z

scholzj
Jan 26, 2025
Maintainer

Sharing the full configurations and logs would be a good start.

9 replies

scholzj Jan 26, 2025
Maintainer

I had a look through the logs ... the ZooKeeper logs are always hard to read, but I think they look fine. The problem is in the Kafka brokers:

2025-01-26 16:02:57,008 ERROR Unable to resolve address: s4-cluster-zookeeper-client/<unresolved>:2181 (org.apache.zookeeper.client.StaticHostProvider) [main-SendThread()]
java.net.UnknownHostException: s4-cluster-zookeeper-client: Name or service not known

I guess you would need to look at your DNS and networking to see why it is not resolving this service. The service name seems to be correct, so it should resolve.

mudasirhaji Jan 27, 2025
Author

I tried to add the service full path in the config map however the pod is not taking the value after restart...
I also added in the custom config map, but it still not take the value,

for example like this:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: s4-cluster
spec:
kafka:
version: 3.6.1
replicas: 3
config:
zookeeper.connect: s4-cluster-zookeeper-client.smartsafety4s-qa.svc.cluster.local:2181

  any suggestions how can i make the pod read this value?

scholzj Jan 27, 2025
Maintainer

You cannot change the address where it connects. Also, the address it uses is correct and should be resolved. As I said, this looks more like some issue in your infrastructure.

mudasirhaji Jan 27, 2025
Author

from the worker on which kafka pods are deployed i am able to resolve:

nslookup s4-cluster-zookeeper-client.smartsafety4s-qa.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10#53

Name: s4-cluster-zookeeper-client.smartsafety4s-qa.svc.cluster.local
Address: 10.100.7.105

Also i added this entry in the coredns

Dont know what else to do?

scholzj Jan 27, 2025
Maintainer

You should not need any special entries. Resolving and routing the service name should happen automatically in your Kubernetes cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

[Bug]:Kafka Pods in CrashLoopBackOff State - Strimzi Kafka Version 0.39 #11078

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Strimzi

[Bug]:Kafka Pods in CrashLoopBackOff State - Strimzi Kafka Version 0.39 #11078

mudasirhaji Jan 26, 2025

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Additional context

Replies: 1 comment · 9 replies

scholzj Jan 26, 2025 Maintainer

scholzj Jan 26, 2025 Maintainer

mudasirhaji Jan 27, 2025 Author

scholzj Jan 27, 2025 Maintainer

mudasirhaji Jan 27, 2025 Author

scholzj Jan 27, 2025 Maintainer

mudasirhaji
Jan 26, 2025

Replies: 1 comment 9 replies

scholzj
Jan 26, 2025
Maintainer

scholzj Jan 26, 2025
Maintainer

mudasirhaji Jan 27, 2025
Author

scholzj Jan 27, 2025
Maintainer

mudasirhaji Jan 27, 2025
Author

scholzj Jan 27, 2025
Maintainer