Brokers became unreachable during CA rotation #11096

mbrembilla · 2025-01-30T08:26:13Z

mbrembilla
Jan 30, 2025

Hello,
I’m running a Kafka cluster on Kubernetes using Strimzi, and I recently encountered a disruption during a CA rotation.
Here’s a quick summary of the issue:
• My cluster-operator started to update the CA certificates for Zookeeper.
• Zookeeper restarted and updated its certificates without apparent errors.
• Right after Zookeeper was updated, all my Kafka brokers became unreachable for clients, causing a service downtime.
• The outage lasted until all brokers finished restarting with the new certificates.

My expectation was that Strimzi’s rolling upgrade/rotation process would prevent a full outage, but that didn’t happen. Instead, it looks like the system became unreachable to my clients during the entire transition period.

I would really appreciate help figuring out what might have gone wrong and how to avoid it in the future. Specifically:
1. Is this an expected scenario if the CA rotation steps aren’t followed in a certain order?
2. Could there be a misconfiguration or a known issue that prevents Strimzi’s rolling restart from maintaining client connectivity?
3. Are there recommended best practices or prerequisites I should verify before triggering a CA rotation?

Strimzi version: 0.39
Kafka version: 3.6.1
Kubernetes version: v1.31.4-eks-2d5f260

If you need additional details, please let me know:
• Which logs would you like to see? (Operator logs, broker logs, Zookeeper logs, etc.)
• What exact configurations or CRD details can I provide (Kafka CR, KafkaUser CR, etc.)?
• Any information about Strimzi, Kafka or Kubernetes environment specifics that might be relevant?

I’d be happy to share any data or logs that could help diagnose the problem. Thank you in advance for your assistance, and I’m looking forward to any suggestions or insights you can provide

Regards,
Mauro

scholzj · 2025-01-30T09:26:32Z

scholzj
Jan 30, 2025
Maintainer

Why did the brokers become unreachable for clients? What exactly it even means? What were the errors etc.? Without full logs from all the components and all configurations, nobody will be able to give you any answers.

3 replies

mbrembilla Jan 30, 2025
Author

I am deeply sorry for being unclear and for not including enough data and information in my request.

The architecture, in addition to Kafka deployed with Strimzi, includes a series of Kafka Streams applications for message processing, a Kafka Rest Proxy that serves as an interface with clients for message acquisition and a Schema Registry for message validation.

When the certificate update began, all components connecting to the cluster and also the brokers started logging connection errors.

Attached, you can find some files containing excerpts of logs from certain services with the encountered errors. If this is not sufficient, please let me know how many and which logs you need to get the information you require.
kafka-kafka-4.log
outbound-stream.log
rest-proxy.log
zookeeper.log

Apologies again for the unclear ticket

scholzj Jan 30, 2025
Maintainer

Well, the logs are clearly incomplete + the cluster operator log is missing. You really need to share everything:

The Kafka custom resource + any related custom resources
The full logs from all the components -> all ZooKeeper nodes, all the Kafka nodes, the operators + ideally from some simple application using the Kafka Java client affected by this (the Kafka Streams or Rest Proxy might do if they are complete, I do not know from the snippets). Where applicable, the logs should be before and after the rolling update that already happened to them.

scholzj Jan 30, 2025
Maintainer

Well, the logs are clearly incomplete + the cluster operator log is missing. You really need to share everything:

The Kafka custom resource + any related custom resources
The full logs from all the components -> all ZooKeeper nodes, all the Kafka nodes, the operators + ideally from some simple application using the Kafka Java client affected by this (the Kafka Streams or Rest Proxy might do if they are complete, I do not know from the snippets). Where applicable, the logs should be before and after the rolling update that already happened to them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Brokers became unreachable during CA rotation #11096

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Strimzi

Brokers became unreachable during CA rotation #11096

mbrembilla Jan 30, 2025

Replies: 1 comment · 3 replies

scholzj Jan 30, 2025 Maintainer

mbrembilla Jan 30, 2025 Author

scholzj Jan 30, 2025 Maintainer

scholzj Jan 30, 2025 Maintainer

mbrembilla
Jan 30, 2025

Replies: 1 comment 3 replies

scholzj
Jan 30, 2025
Maintainer

mbrembilla Jan 30, 2025
Author

scholzj Jan 30, 2025
Maintainer

scholzj Jan 30, 2025
Maintainer