[Bug]: Kafka operator becomes dysfunctional after setting `controller.quorum.fetch.timeout.ms` to `-1` #11084

kos-team · 2025-01-28T01:11:29Z

Bug Description

Kafka operator is stuck with an erroneous configuration value of controller.quorum.fetch.timeout.ms=-1. The root cause might be that this configuration value is used by operator to check the health of the nodes, and since it is set to -1, it immediately times out.

Steps to reproduce

Deploy the Kafka operator, node pool, and the Kafka CR
Change thespec.kafka.config with controller.quorum.fetch.timeout.ms set as -1

Expected behavior

The Kafka operator should reject the invalid configuration value of controller.quorum.fetch.timeout.ms and remain functional

Strimzi version

quay.io/strimzi/operator:0.45.0

Kubernetes version

v1.28.0

Installation method

YAML

Infrastructure

kind v0.21.0 go1.22.6 linux/amd64

Configuration files and logs

Initial Kafka CR file

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  annotations:
    strimzi.io/kraft: enabled
    strimzi.io/node-pools: enabled
  name: test-cluster
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    config:
      default.replication.factor: 3
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.min.isr: 2
      transaction.state.log.replication.factor: 3
    listeners:
    - name: plain
      port: 9092
      tls: false
      type: internal
    - name: tls
      port: 9093
      tls: true
      type: internal
    metadataVersion: 3.9-IV0
    version: 3.9.0

Kafka CR file to change the securityContext

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  annotations:
    strimzi.io/kraft: enabled
    strimzi.io/node-pools: enabled
  name: test-cluster
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    config:
      controller.quorum.fetch.timeout.ms: -1
      default.replication.factor: 3
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.min.isr: 2
      transaction.state.log.replication.factor: 3
    listeners:
    - name: plain
      port: 9092
      tls: false
      type: internal
    - name: tls
      port: 9093
      tls: true
      type: internal
    metadataVersion: 3.9-IV0
    version: 3.9.0

Additional context

2025-01-28 01:08:10 INFO  KafkaRoller:673 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Pod test-cluster-dual-role-2/2 needs to be restarted, dynamic update cannot be do
ne.
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:62 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Determining the controller quorum leader id
2025-01-28 01:08:10 DEBUG KafkaRoller:970 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): KRaft active controller is 0
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:45 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Determining whether controller pod 2 can be rolled
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:93 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): The lastCaughtUpTimestamp for the controller quorum leader (node id 0) is 1738026490044
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:101 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): The lastCaughtUpTimestamp for controller 0 is 1738026490044
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:101 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): The lastCaughtUpTimestamp for controller 1 is 1738026489860
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:110 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Controller 1 has fallen behind the controller quorum leader
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:101 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): The lastCaughtUpTimestamp for controller 2 is 1738026489860
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:110 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Controller 2 has fallen behind the controller quorum leader
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:116 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Out of 3 controllers, there are 1 that have caught up with the controller quorum leader, not including controller 2
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:49 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Not restarting controller pod 2. Restart would affect the quorum health
2025-01-28 01:08:10 DEBUG KafkaAvailability:62 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Determining whether broker 2 can be rolled
2025-01-28 01:08:10 DEBUG KafkaAvailability:216 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got 0 topic names
2025-01-28 01:08:10 DEBUG KafkaAvailability:51 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got 0 topic names
2025-01-28 01:08:10 DEBUG KafkaAvailability:202 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got topic descriptions for 0 topics
2025-01-28 01:08:10 DEBUG KafkaAvailability:69 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got 0 topic descriptions
2025-01-28 01:08:10 DEBUG KafkaAvailability:161 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Getting topic configs for 0 topics
2025-01-28 01:08:10 DEBUG KafkaAvailability:170 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got topic configs for 0 topics
2025-01-28 01:08:10 DEBUG KafkaRoller:468 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Pod test-cluster-dual-role-2/2 cannot be updated right now
2025-01-28 01:08:10 INFO  KafkaRoller:388 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Will temporarily skip verifying pod test-cluster-dual-role-2/2 is up-to-date due
to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod test-cluster-dual-role-2 cannot be updated right now., retrying after at least 8000ms

The text was updated successfully, but these errors were encountered:

scholzj · 2025-01-28T01:18:52Z

I guess you can unset it? I'm not sure there is much Strimzi can do with a configuration like this. We can block it from being configurable. But that might not allow you to tune this option.

kos-team · 2025-01-28T01:21:57Z

@scholzj Thanks for the prompt response. Yes we were able to recover it by reverting, and the operator recovered after 10 mins.

In terms of rejecting the invalid valid, is there a way for the kafka-operator to reject the configuration when it is set to negative value, instead of completely blocking it?

scholzj · 2025-01-28T07:30:58Z

I think there are two separate issues.

First one that you see here is that when you set it to -1, the operator has no way to determine whether the controller is up to date. This is because Kafka does not provide any state to indicate that the controller is in-sync or not and it has to be calculated using this value.

The second is whether having -1 set here is a valid value for Kafka. TBH, I do not know. But I think the way we validate the configuration is based on the Kafka configuration model and that does not seem to provide any indication whether -1 is or is not a valid value. Although to be honest I struggle to imagine the meaning of 0, -1 or other negative numbers in the context of this option.

In general, I'm not sure Strimzi can easily handle something like this and whether it should handle it. I think there are many ways how you can break your Kafka cluster through these options. The expectation is that when go into configuring options like these, you know what you are doing and you test your setting properly. But let's see what others think when the issue is triaged.

kos-team · 2025-01-29T22:58:31Z

Thanks for the nice breakdown. the points accurately describe the issues.
I think it comes down to two points:

the value of the controller.quorum.fetch.timeout.ms configuration is used by the Strimzi to determine if the controller is in-sync,
the upstream Kafka did not define the validity of the controller.quorum.fetch.timeout.ms.

Since Strimzi is not only propagating this configuration Kafka, but also relying on this value to manage Kafka, controller.quorum.fetch.timeout.ms can be considered as a configuration of Strimzi itself. So I think it makes sense for Strimzi to validate the value of controller.quorum.fetch.timeout.ms config itself, and the validity does not have to depend on the upstream Kafka. In this case, less or equal to 0 can be considered as invalid value and rejected by Strimzi.

scholzj · 2025-01-30T00:13:47Z

As I said, it will be triaged ... but I think the complexity of the implementation is much higher than the benefit of it.

kos-team added bug needs-triage labels Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Kafka operator becomes dysfunctional after setting `controller.quorum.fetch.timeout.ms` to `-1` #11084

[Bug]: Kafka operator becomes dysfunctional after setting `controller.quorum.fetch.timeout.ms` to `-1` #11084

kos-team commented Jan 28, 2025

scholzj commented Jan 28, 2025

kos-team commented Jan 28, 2025

scholzj commented Jan 28, 2025

kos-team commented Jan 29, 2025

scholzj commented Jan 30, 2025

[Bug]: Kafka operator becomes dysfunctional after setting controller.quorum.fetch.timeout.ms to -1 #11084

[Bug]: Kafka operator becomes dysfunctional after setting controller.quorum.fetch.timeout.ms to -1 #11084

Comments

kos-team commented Jan 28, 2025

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Additional context

scholzj commented Jan 28, 2025

kos-team commented Jan 28, 2025

scholzj commented Jan 28, 2025

kos-team commented Jan 29, 2025

scholzj commented Jan 30, 2025

[Bug]: Kafka operator becomes dysfunctional after setting `controller.quorum.fetch.timeout.ms` to `-1` #11084

[Bug]: Kafka operator becomes dysfunctional after setting `controller.quorum.fetch.timeout.ms` to `-1` #11084