Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system test] Kraft cluster not stuck during controller quorum timeout #11031

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

see-quick
Copy link
Member

@see-quick see-quick commented Jan 14, 2025

Type of change

  • Enhancement / new feature

Description

This PR adds a test case where we check that the KRaft cluster will not be stuck during controller quorum timeout.

P.S: I am not sure if we want to also check in Cluster Operator logs that:

2024-12-11 10:09:15 WARN  KafkaQuorumCheck:64 - Reconciliation #55(watch) Kafka(myproject/my-cluster): Error determining the controller quorum leader id
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeMetadataQuorum

Does not exist anymore after RollingUpdate etc. (but IMO if RollingUpdate happens it's prove that such bug is not present).

Checklist

  • Write tests
  • Make sure all tests pass
  • Update documentation

@see-quick see-quick added this to the 0.46.0 milestone Jan 14, 2025
@see-quick see-quick self-assigned this Jan 14, 2025
@see-quick see-quick requested review from a team January 14, 2025 10:59
@ppatierno
Copy link
Member

Trying to understand the purpose of the test. Isn't it just making a configuration change and waiting for all pods rolling while the cluster is still functioning? Isn't it something already covered in most of the tests? I mean, where are we trying to drive to have a timeout while checking controller quorum and then checking that the operator doesn't get stuck on it?

@see-quick
Copy link
Member Author

see-quick commented Jan 14, 2025

Trying to understand the purpose of the test. Isn't it just making a configuration change and waiting for all pods rolling while the cluster is still functioning? Isn't it something already covered in most of the tests? I mean, where are we trying to drive to have a timeout while checking controller quorum and then checking that the operator doesn't get stuck on it?

I don't know if we do exactly this with dedicated controllers. But as mentioned here [1] I think the reproducer was:

1. Deploy KRaft cluster with dedicated controllers
2. Once deployed, modify the Kafka CR to modify a value that requires rolling update - e.g. auto.create.topics.enable
3. Wait for the rolling update
4.  CO will get stuck with this:
2024-12-11 10:09:15 WARN  KafkaQuorumCheck:64 - Reconciliation #55(watch) Kafka(myproject/my-cluster): Error determining the controller quorum leader id
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeMetadataQuorum

I mean, we can also try to fetch Cluster Operator logs, but I'm not sure how reliable it would be compared to our past experience. Maybe @scholzj has more info about this?

[1] - #10940

Signed-off-by: see-quick <[email protected]>
@ppatierno
Copy link
Member

I don't know if we do exactly this with dedicated controllers.

Are you saying that all our STs are using mixed nodes?

Looking at #10940, the problem there was a missing network policy so if the problem was still in place I would expect a lot of tests failing today because of CO not able to talk directly to controllers.

@im-konge
Copy link
Member

Are you saying that all our STs are using mixed nodes?

No, we are using separate roles in the STs.

I would expect a lot of tests failing today because of CO not able to talk directly to controllers.

I think that we don't have env with enforced network policies, as the tests were passing for the Tina's PR anyway.

@scholzj
Copy link
Member

scholzj commented Jan 14, 2025

@see-quick @ppatierno I think having a test for this would be good. And I think the test is mostly good - but you are framing it in a weird way. We do not really want to test for the past bugs that we fixed. We want to test for the features we have. So we want to test is that the controller-only nodes are rolled properly. So I guess you should for example:

  • Name it something like testKafkaRollingUpdatesWithDedicatedControllers or something
  • Update the log message
    LOGGER.info("Modifying Kafka CR to enable auto.create.topics.enable=false, expecting rolling update without CO getting stuck.");
    
    to something like
    LOGGER.info("Modifying Kafka CR to enable auto.create.topics.enable=false, expecting rolling update of all nodes incuding controllers.");
    

@ppatierno
Copy link
Member

No, we are using separate roles in the STs.

Which is a good news :-)

I think that we don't have env with enforced network policies, as the tests were passing for the Tina's PR anyway.

So AFAIU, but I could be wrong, it looks like even the test in this PR could be useless in the sense that if it was in place before the bug it wouldn't have caught it because of missing network policies enforcement.
At this point should we have some tests with network policies enforcement instead?

@ppatierno
Copy link
Member

@scholzj I see last sentence got cut? "The reason for that is that ...." :-)

@scholzj
Copy link
Member

scholzj commented Jan 14, 2025

@ppatierno that was just not deleted.

@see-quick
Copy link
Member Author

At this point should we have some tests with network policies enforcement instead?

I can do it if we agree that's the best approach.

@Frawless
Copy link
Member

At this point should we have some tests with network policies enforcement instead?

I think we already have tests for that. Problem is, that our minikube env on azure or kind on TF doesn't have NP enabled. If we want to improve it, we should take this way and configure it on env level.

@im-konge
Copy link
Member

I can do it if we agree that's the best approach.

I think that you will have to enable the enforcement of NPs on minikube with some plugin or something for AZPs - as I think minikube doesn't have it enabled by default. For Jenkins and OCP pipelines, there is this enforcement, as the tests failed IIRC.

Also, I'm not sure if checking some NPs make sense after the fix - if it makes sense to check it in some regression pipelines.
Having the test for checking if the controller nodes were correctly rolled makes sense to me -> as we had similar tests for ZK, but I think we didn't have such for KRaft controllers.

@ppatierno
Copy link
Member

We do not really want to test for the past bugs that we fixed. We want to test for the features we have. So we want to test is that the controller-only nodes are rolled properly

If we aim to test just the controller-only nodes are rolled, so I am fine. If we are aiming to check the #10940 then no, this test doesn't really test it without NP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants