-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[system test] Kraft cluster not stuck during controller quorum timeout #11031
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: see-quick <[email protected]>
Trying to understand the purpose of the test. Isn't it just making a configuration change and waiting for all pods rolling while the cluster is still functioning? Isn't it something already covered in most of the tests? I mean, where are we trying to drive to have a timeout while checking controller quorum and then checking that the operator doesn't get stuck on it? |
I don't know if we do exactly this with dedicated controllers. But as mentioned here [1] I think the reproducer was:
2024-12-11 10:09:15 WARN KafkaQuorumCheck:64 - Reconciliation #55(watch) Kafka(myproject/my-cluster): Error determining the controller quorum leader id
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeMetadataQuorum I mean, we can also try to fetch Cluster Operator logs, but I'm not sure how reliable it would be compared to our past experience. Maybe @scholzj has more info about this? [1] - #10940 |
Signed-off-by: see-quick <[email protected]>
Are you saying that all our STs are using mixed nodes? Looking at #10940, the problem there was a missing network policy so if the problem was still in place I would expect a lot of tests failing today because of CO not able to talk directly to controllers. |
No, we are using separate roles in the STs.
I think that we don't have env with enforced network policies, as the tests were passing for the Tina's PR anyway. |
@see-quick @ppatierno I think having a test for this would be good. And I think the test is mostly good - but you are framing it in a weird way. We do not really want to test for the past bugs that we fixed. We want to test for the features we have. So we want to test is that the controller-only nodes are rolled properly. So I guess you should for example:
|
Which is a good news :-)
So AFAIU, but I could be wrong, it looks like even the test in this PR could be useless in the sense that if it was in place before the bug it wouldn't have caught it because of missing network policies enforcement. |
@scholzj I see last sentence got cut? "The reason for that is that ...." :-) |
@ppatierno that was just not deleted. |
I can do it if we agree that's the best approach. |
I think we already have tests for that. Problem is, that our minikube env on azure or kind on TF doesn't have NP enabled. If we want to improve it, we should take this way and configure it on env level. |
I think that you will have to enable the enforcement of NPs on minikube with some plugin or something for AZPs - as I think minikube doesn't have it enabled by default. For Jenkins and OCP pipelines, there is this enforcement, as the tests failed IIRC. Also, I'm not sure if checking some NPs make sense after the fix - if it makes sense to check it in some regression pipelines. |
If we aim to test just the controller-only nodes are rolled, so I am fine. If we are aiming to check the #10940 then no, this test doesn't really test it without NP. |
Type of change
Description
This PR adds a test case where we check that the KRaft cluster will not be stuck during controller quorum timeout.
P.S: I am not sure if we want to also check in Cluster Operator logs that:
Does not exist anymore after RollingUpdate etc. (but IMO if RollingUpdate happens it's prove that such bug is not present).
Checklist