Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Kafka operator becomes dysfunctional after setting controller.quorum.fetch.timeout.ms to -1 #11084

Open
kos-team opened this issue Jan 28, 2025 · 5 comments

Comments

@kos-team
Copy link

Bug Description

Kafka operator is stuck with an erroneous configuration value of controller.quorum.fetch.timeout.ms=-1. The root cause might be that this configuration value is used by operator to check the health of the nodes, and since it is set to -1, it immediately times out.

Steps to reproduce

  1. Deploy the Kafka operator, node pool, and the Kafka CR
  2. Change thespec.kafka.config with controller.quorum.fetch.timeout.ms set as -1

Expected behavior

The Kafka operator should reject the invalid configuration value of controller.quorum.fetch.timeout.ms and remain functional

Strimzi version

quay.io/strimzi/operator:0.45.0

Kubernetes version

v1.28.0

Installation method

YAML

Infrastructure

kind v0.21.0 go1.22.6 linux/amd64

Configuration files and logs

  1. Initial Kafka CR file
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  annotations:
    strimzi.io/kraft: enabled
    strimzi.io/node-pools: enabled
  name: test-cluster
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    config:
      default.replication.factor: 3
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.min.isr: 2
      transaction.state.log.replication.factor: 3
    listeners:
    - name: plain
      port: 9092
      tls: false
      type: internal
    - name: tls
      port: 9093
      tls: true
      type: internal
    metadataVersion: 3.9-IV0
    version: 3.9.0
  1. Kafka CR file to change the securityContext
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  annotations:
    strimzi.io/kraft: enabled
    strimzi.io/node-pools: enabled
  name: test-cluster
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    config:
      controller.quorum.fetch.timeout.ms: -1
      default.replication.factor: 3
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.min.isr: 2
      transaction.state.log.replication.factor: 3
    listeners:
    - name: plain
      port: 9092
      tls: false
      type: internal
    - name: tls
      port: 9093
      tls: true
      type: internal
    metadataVersion: 3.9-IV0
    version: 3.9.0

Additional context

2025-01-28 01:08:10 INFO  KafkaRoller:673 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Pod test-cluster-dual-role-2/2 needs to be restarted, dynamic update cannot be do
ne.
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:62 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Determining the controller quorum leader id
2025-01-28 01:08:10 DEBUG KafkaRoller:970 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): KRaft active controller is 0
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:45 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Determining whether controller pod 2 can be rolled
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:93 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): The lastCaughtUpTimestamp for the controller quorum leader (node id 0) is 1738026490044
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:101 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): The lastCaughtUpTimestamp for controller 0 is 1738026490044
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:101 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): The lastCaughtUpTimestamp for controller 1 is 1738026489860
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:110 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Controller 1 has fallen behind the controller quorum leader
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:101 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): The lastCaughtUpTimestamp for controller 2 is 1738026489860
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:110 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Controller 2 has fallen behind the controller quorum leader
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:116 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Out of 3 controllers, there are 1 that have caught up with the controller quorum leader, not including controller 2
2025-01-28 01:08:10 DEBUG KafkaQuorumCheck:49 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Not restarting controller pod 2. Restart would affect the quorum health
2025-01-28 01:08:10 DEBUG KafkaAvailability:62 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Determining whether broker 2 can be rolled
2025-01-28 01:08:10 DEBUG KafkaAvailability:216 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got 0 topic names
2025-01-28 01:08:10 DEBUG KafkaAvailability:51 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got 0 topic names
2025-01-28 01:08:10 DEBUG KafkaAvailability:202 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got topic descriptions for 0 topics
2025-01-28 01:08:10 DEBUG KafkaAvailability:69 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got 0 topic descriptions
2025-01-28 01:08:10 DEBUG KafkaAvailability:161 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Getting topic configs for 0 topics
2025-01-28 01:08:10 DEBUG KafkaAvailability:170 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Got topic configs for 0 topics
2025-01-28 01:08:10 DEBUG KafkaRoller:468 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Pod test-cluster-dual-role-2/2 cannot be updated right now
2025-01-28 01:08:10 INFO  KafkaRoller:388 - Reconciliation #5(watch) Kafka(acto-namespace/test-cluster): Will temporarily skip verifying pod test-cluster-dual-role-2/2 is up-to-date due
to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod test-cluster-dual-role-2 cannot be updated right now., retrying after at least 8000ms
@scholzj
Copy link
Member

scholzj commented Jan 28, 2025

I guess you can unset it? I'm not sure there is much Strimzi can do with a configuration like this. We can block it from being configurable. But that might not allow you to tune this option.

@kos-team
Copy link
Author

@scholzj Thanks for the prompt response. Yes we were able to recover it by reverting, and the operator recovered after 10 mins.

In terms of rejecting the invalid valid, is there a way for the kafka-operator to reject the configuration when it is set to negative value, instead of completely blocking it?

@scholzj
Copy link
Member

scholzj commented Jan 28, 2025

I think there are two separate issues.

First one that you see here is that when you set it to -1, the operator has no way to determine whether the controller is up to date. This is because Kafka does not provide any state to indicate that the controller is in-sync or not and it has to be calculated using this value.

The second is whether having -1 set here is a valid value for Kafka. TBH, I do not know. But I think the way we validate the configuration is based on the Kafka configuration model and that does not seem to provide any indication whether -1 is or is not a valid value. Although to be honest I struggle to imagine the meaning of 0, -1 or other negative numbers in the context of this option.

In general, I'm not sure Strimzi can easily handle something like this and whether it should handle it. I think there are many ways how you can break your Kafka cluster through these options. The expectation is that when go into configuring options like these, you know what you are doing and you test your setting properly. But let's see what others think when the issue is triaged.

@kos-team
Copy link
Author

Thanks for the nice breakdown. the points accurately describe the issues.
I think it comes down to two points:

  1. the value of the controller.quorum.fetch.timeout.ms configuration is used by the Strimzi to determine if the controller is in-sync,
  2. the upstream Kafka did not define the validity of the controller.quorum.fetch.timeout.ms.

Since Strimzi is not only propagating this configuration Kafka, but also relying on this value to manage Kafka, controller.quorum.fetch.timeout.ms can be considered as a configuration of Strimzi itself. So I think it makes sense for Strimzi to validate the value of controller.quorum.fetch.timeout.ms config itself, and the validity does not have to depend on the upstream Kafka. In this case, less or equal to 0 can be considered as invalid value and rejected by Strimzi.

@scholzj
Copy link
Member

scholzj commented Jan 30, 2025

As I said, it will be triaged ... but I think the complexity of the implementation is much higher than the benefit of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants