-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the KafkaRoller issue talking to controllers #10941
base: main
Are you sure you want to change the base?
Conversation
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
As discussed in the kafka call, we might need to workaround it by sending |
On second thought, I think ideally, the retry should be done in the admin client side. Opened KAFKA-18230 for the improvement. |
@showuon that's great. Thank you! |
Yes, I agree. I switched the PR to draft to implement describe cluster :) |
09fa158
to
7f2c57a
Compare
Remove version checks to simplify the changes as this change will be targeting Strimzi that only supports Kafka version 3.9+ Refactor unit tests to remove ZK based testcases and add more coverage for KRaft based clusters Signed-off-by: Gantigmaa Selenge <[email protected]>
7f2c57a
to
eea6559
Compare
I have updated the PR so that KafkaRoller uses describeCluster when retrieving active controller id. When getting quorum information, it would create an admin client with just active controller's bootstrap address. I also removed version check for Kafka version 3.9.0, since this change would be targeting a release that doesn't support older than 3.9.0. This simplifies the code much better. There are a couple of TODOs I would like to discuss with others, particularly the one for the tcpProbe for individual node since this check does not make sense anymore where it was. @scholzj @ppatierno @katheris, maybe you have thoughts on those? Of course thoughts from anyone else would also be appreciated. |
// TODO: if the active controller is returned as -1 (maybe election was still in progress), | ||
// use the admin client bootstrapped with all controller nodes to see if it can hit the active controller by luck. | ||
// If it does not hit the active controller, it will return an error and this node will be retried later anyway. | ||
// Otherwise we can just return false here, and the node will be then retried later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO to discuss:
This could cause the Warning message with NOT_LEADER_OR_FOLLOWER but both would result in retrying the node if active controller is not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly are the situations when we expect to run into this? When there are no running brokers and only controllers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are querying the active controller from the controllers, not brokers so no running brokers should not be a problem. I think they could return -1, if there is no leader elected yet/election is still in progress? Could be also an indication of quorum being in a bad state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we cannot get the active controller (for any reasons) how much is it dangerous returning true to roll the current controller we are checking vs returning false, not rolling and waiting a better time to get the active controller? Could it be "forever" because of a bad state in the quorum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, if we cannot get the active controller, we cannot perform the quorum check. Currently, if we cannot to perform quorum check, we do not roll the pod, it throws UnforceableProblem
. So the question, do we follow the same logic, or use the admin client bootstrapped with all controller nodes.
// having an additional purpose of checking connection. Plus, this method is now only called for a controller node | ||
// as we don't need to check the active controller when rolling brokers. That means tcp probe is now done only | ||
// for a controller node. The tcpProbe for an individual node perhaps should be done earlier in checkIfRestartOrReconfigureRequired() | ||
// as this method's purpose is to determine the reason for restart and it already does similar type of checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO to discuss:
I think having this check here does not make sense anymore as we would not need to get the active controller when rolling broker nodes, so tcp probe only apply to controller nodes. Also note that, this check was only done, if admin client could not be created with the individual node's address. (I have temporarily commented out the unit tests for this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the TCP probe was mostly a workaround for not being able to connect to the controller directly. The question I guess is whether it should be removed or whether it should be replaced by checking the specific controller with Admin client unless the previous code already did that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before we get to this method, in checkIfRestartOrReconfigureRequired() initialises admin client (if it hasn't been initialised) with all controller addresses. If this fails, we force restart the node. So it is not checking specific controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the client is initialized with all controller addresses, it sounds like that does not give a reliable answer about the state of the individual node. So I think it would make sense to have somewhere some code that would connect to the node individually and see if it works? Don't we do something like that for the brokers? Does the dynamic configuration connect to individual nodes?
I think the tcp probe itself has little value. So I do not have problem removing it. But I'm wondering if it should be just removed or replaced with something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think it would make sense to have somewhere some code that would connect to the node individually and see if it works? Don't we do something like that for the brokers?
Currently we connect to individual nodes, only in this method. We could actually just keep the logic of connecting to individual nodes regardless of their role in this method actually, as the describe call would work with any node.
@@ -209,20 +207,11 @@ private boolean maybeInitBrokerAdminClient() { | |||
* Initializes controllerAdminClient if it has not been initialized yet | |||
* @return true if the creation of AC succeeded, false otherwise | |||
*/ | |||
private boolean maybeInitControllerAdminClient(String currentVersion) { | |||
private boolean maybeInitControllerAdminClient() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to keep the 3.8.x versions working for the time being. We in general expect Strimi 0.46 to support Kafka 3.9.0 and 4.0.0. But who knows what happens with 4.0.0 and when it will be ready. So I wonder if we should already remove the logic for supporting Kafka 3.8.0.
WDYT @ppatierno @katheris @tinaselenge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is imagine that Kafka 4.0.0 is released only in April or May (not unimaginable in my opinion). My original plan for the 0.46.0 release has the intention to remove ZooKeeper now ... but if needed because of big delays in Kafka we could still do a KRaft-only release with 3.8 & 3.9 with some new Strimzi features. But if we remove the 3.8 logic now, that will not be possible anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's a good point. Having the version check complicates the dynamic configuration for controllers a little bit because we can only do it for controllers 3.9.0+. Perhaps dynamic reconfiguration for controllers should wait until things are more certain 4.0.0 and I can put back the version check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I realize that it does not make the development easier. That is also why I CCed others to it to comment.
In general
- We promise to support two last minor versions of Kafka
- I think the upgrade tests rely on it as well in some cases
So the question is ... if we just drop 3.8 support now, what is the short term impact on all of our codebase, tests, etc. And what is the long-term impact if Kafka 4.0 takes some major delays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jakub made a good point and we cannot rely on a "good early" date for Kafka 4.0.
I can also understand Tina's point making the code complicated.
Can't we postpone implementing this when the only supported versions will be 3.9 and 4.0? What's the impact if we postpone it. I guess current solution is still somehow working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we postpone this until we only support 3.9 and 4.0, this PR may not be even needed if KAFKA-18230 gets included in the 4.0 release, as this would make the Admin client handle the error internally.
I was working on a commit to revert back the 3.9 check, but it ended up requiring more refactoring and resulting in more complicated code. Because we would have to check current version of an individual node we are rolling but also check cluster's version (passed by the reconciler) to decide whether to create a separate admin client just bootstrapped with the active controller. If we create this admin client , we then also need to check the active controller's current version as it could still be running an older version, even if the cluster's version is newer or the node being rolled has a newer version. I'm not sure if I'm over complicating this in my head, but seems like we have to keep track of versions in many places.
So I'm not sure at this point, how to move forward with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working on a commit to revert back the 3.9 check, but it ended up requiring more refactoring and resulting in more complicated code. Because we would have to check current version of an individual node we are rolling but also check cluster's version (passed by the reconciler) to decide whether to create a separate admin client just bootstrapped with the active controller. If we create this admin client , we then also need to check the active controller's current version as it could still be running an older version, even if the cluster's version is newer or the node being rolled has a newer version. I'm not sure if I'm over complicating this in my head, but seems like we have to keep track of versions in many places.
I think it is fine to wait for KAFKA-18230. But that will be done in Kafka 4.0 only. So you would still need a check to differentiate 3.9 and 4.0. So would that mean in reality waiting for Kafka 4.1.0 release which is where we would drop support for Kafka 3.9 and all versions will support KAFKA-18230?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right to me. What do we do about the main branch that currently needs this fix? Do we revert it like we did for 0.45?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we would revert it in that case, yeah. But maybe that is worth discussing on a wider forum such as the community call or at least making sure more people have awareness of this. CC @ppatierno @katheris
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is to revert back as we did for 0.45.0 instead of adding more complexity here about checking several Kafka versions.
Type of change
Select the type of your PR
Description
Checklist
Please go through this checklist and make sure all applicable tasks have been done