[system test] Kraft cluster not stuck during controller quorum timeout #11031

see-quick · 2025-01-14T10:59:32Z

Type of change

Enhancement / new feature

Description

This PR adds a test case where we check that the KRaft cluster will not be stuck during controller quorum timeout.

P.S: I am not sure if we want to also check in Cluster Operator logs that:

2024-12-11 10:09:15 WARN  KafkaQuorumCheck:64 - Reconciliation #55(watch) Kafka(myproject/my-cluster): Error determining the controller quorum leader id
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeMetadataQuorum

Does not exist anymore after RollingUpdate etc. (but IMO if RollingUpdate happens it's prove that such bug is not present).

Checklist

Write tests
Make sure all tests pass
Update documentation

Signed-off-by: see-quick <[email protected]>

ppatierno · 2025-01-14T11:20:01Z

Trying to understand the purpose of the test. Isn't it just making a configuration change and waiting for all pods rolling while the cluster is still functioning? Isn't it something already covered in most of the tests? I mean, where are we trying to drive to have a timeout while checking controller quorum and then checking that the operator doesn't get stuck on it?

see-quick · 2025-01-14T12:11:45Z

Trying to understand the purpose of the test. Isn't it just making a configuration change and waiting for all pods rolling while the cluster is still functioning? Isn't it something already covered in most of the tests? I mean, where are we trying to drive to have a timeout while checking controller quorum and then checking that the operator doesn't get stuck on it?

I don't know if we do exactly this with dedicated controllers. But as mentioned here [1] I think the reproducer was:

1. Deploy KRaft cluster with dedicated controllers
2. Once deployed, modify the Kafka CR to modify a value that requires rolling update - e.g. auto.create.topics.enable
3. Wait for the rolling update
4.  CO will get stuck with this:

2024-12-11 10:09:15 WARN  KafkaQuorumCheck:64 - Reconciliation #55(watch) Kafka(myproject/my-cluster): Error determining the controller quorum leader id
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeMetadataQuorum

I mean, we can also try to fetch Cluster Operator logs, but I'm not sure how reliable it would be compared to our past experience. Maybe @scholzj has more info about this?

[1] - #10940

Signed-off-by: see-quick <[email protected]>

ppatierno · 2025-01-14T12:20:17Z

I don't know if we do exactly this with dedicated controllers.

Are you saying that all our STs are using mixed nodes?

Looking at #10940, the problem there was a missing network policy so if the problem was still in place I would expect a lot of tests failing today because of CO not able to talk directly to controllers.

im-konge · 2025-01-14T12:30:56Z

Are you saying that all our STs are using mixed nodes?

No, we are using separate roles in the STs.

I would expect a lot of tests failing today because of CO not able to talk directly to controllers.

I think that we don't have env with enforced network policies, as the tests were passing for the Tina's PR anyway.

scholzj · 2025-01-14T13:39:24Z

@see-quick @ppatierno I think having a test for this would be good. And I think the test is mostly good - but you are framing it in a weird way. We do not really want to test for the past bugs that we fixed. We want to test for the features we have. So we want to test is that the controller-only nodes are rolled properly. So I guess you should for example:

Name it something like testKafkaRollingUpdatesWithDedicatedControllers or something

Update the log message

LOGGER.info("Modifying Kafka CR to enable auto.create.topics.enable=false, expecting rolling update without CO getting stuck.");

to something like

LOGGER.info("Modifying Kafka CR to enable auto.create.topics.enable=false, expecting rolling update of all nodes incuding controllers.");

ppatierno · 2025-01-14T13:40:14Z

No, we are using separate roles in the STs.

Which is a good news :-)

I think that we don't have env with enforced network policies, as the tests were passing for the Tina's PR anyway.

So AFAIU, but I could be wrong, it looks like even the test in this PR could be useless in the sense that if it was in place before the bug it wouldn't have caught it because of missing network policies enforcement.
At this point should we have some tests with network policies enforcement instead?

ppatierno · 2025-01-14T14:49:03Z

@scholzj I see last sentence got cut? "The reason for that is that ...." :-)

scholzj · 2025-01-14T14:53:00Z

@ppatierno that was just not deleted.

see-quick · 2025-01-15T08:20:12Z

At this point should we have some tests with network policies enforcement instead?

I can do it if we agree that's the best approach.

Frawless · 2025-01-15T08:30:46Z

At this point should we have some tests with network policies enforcement instead?

I think we already have tests for that. Problem is, that our minikube env on azure or kind on TF doesn't have NP enabled. If we want to improve it, we should take this way and configure it on env level.

im-konge · 2025-01-15T08:31:05Z

I can do it if we agree that's the best approach.

I think that you will have to enable the enforcement of NPs on minikube with some plugin or something for AZPs - as I think minikube doesn't have it enabled by default. For Jenkins and OCP pipelines, there is this enforcement, as the tests failed IIRC.

Also, I'm not sure if checking some NPs make sense after the fix - if it makes sense to check it in some regression pipelines.
Having the test for checking if the controller nodes were correctly rolled makes sense to me -> as we had similar tests for ZK, but I think we didn't have such for KRaft controllers.

ppatierno · 2025-01-16T08:28:14Z

We do not really want to test for the past bugs that we fixed. We want to test for the features we have. So we want to test is that the controller-only nodes are rolled properly

If we aim to test just the controller-only nodes are rolled, so I am fine. If we are aiming to check the #10940 then no, this test doesn't really test it without NP.

[system test] Kraft cluster not stuck during controller quorum timeout

5a2ff83

Signed-off-by: see-quick <[email protected]>

see-quick added needs review System tests labels Jan 14, 2025

see-quick added this to the 0.46.0 milestone Jan 14, 2025

see-quick self-assigned this Jan 14, 2025

see-quick requested review from a team January 14, 2025 10:59

update checkstye

0f88c3a

Signed-off-by: see-quick <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[system test] Kraft cluster not stuck during controller quorum timeout #11031

[system test] Kraft cluster not stuck during controller quorum timeout #11031

see-quick commented Jan 14, 2025 •

edited

Loading

ppatierno commented Jan 14, 2025

see-quick commented Jan 14, 2025 •

edited

Loading

ppatierno commented Jan 14, 2025

im-konge commented Jan 14, 2025

scholzj commented Jan 14, 2025 •

edited

Loading

ppatierno commented Jan 14, 2025

ppatierno commented Jan 14, 2025

scholzj commented Jan 14, 2025

see-quick commented Jan 15, 2025

Frawless commented Jan 15, 2025

im-konge commented Jan 15, 2025

ppatierno commented Jan 16, 2025

[system test] Kraft cluster not stuck during controller quorum timeout #11031

Are you sure you want to change the base?

[system test] Kraft cluster not stuck during controller quorum timeout #11031

Conversation

see-quick commented Jan 14, 2025 • edited Loading

Type of change

Description

Checklist

ppatierno commented Jan 14, 2025

see-quick commented Jan 14, 2025 • edited Loading

ppatierno commented Jan 14, 2025

im-konge commented Jan 14, 2025

scholzj commented Jan 14, 2025 • edited Loading

ppatierno commented Jan 14, 2025

ppatierno commented Jan 14, 2025

scholzj commented Jan 14, 2025

see-quick commented Jan 15, 2025

Frawless commented Jan 15, 2025

im-konge commented Jan 15, 2025

ppatierno commented Jan 16, 2025

see-quick commented Jan 14, 2025 •

edited

Loading

see-quick commented Jan 14, 2025 •

edited

Loading

scholzj commented Jan 14, 2025 •

edited

Loading