-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-38404][core] Fix JobMasterTest by making it scheduler-type aware #27091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I don't think e2e test failure is related |
| .map(Event::getAttributes) | ||
| .map(x -> x.get("newJobStatus"))) | ||
| .containsExactly( | ||
| .containsSubsequence( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick look at this is. My concern is that the CREATED event might not be coming out - and this is a Flink bug. I would feel for confident if the test checks that the CREATED event is coming out and the code be arranged so that the race condition does not occur. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this test cares more about the RUNNING → FAILING → FAILED events sequence when planned failures happen, I agree it's possible to hide a real problem in the Flink non-testing code. However, I do not find an easy fix that only touches the testing code to avoid the race condition. As the JobStatusChangeEvent is only for new job status (code), the initial CREATED is not expected to be captured here.
Reporting JobStatusChange event seems pretty new. @pnowojski may provide some ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we'd like to change the Flink non-testing code, I have updated the PR to demonstrate the idea of notifying the jobStatus listeners at registration time. But I presume that's a larger code change, and just share for discussion. I'll look into the code more carefully.
|
From my observation, when this test was introduced, it was only validated on the |
|
@noorall This is indeed very helpful context. I can revert the new commit and make improvements on the first commit that fixes the test here. One quick question though: why the test fails intermittently? Is the scheduler used in tests non-deterministic? Can we make it static in test and/or make test code aware of which schedule is being used? Thanks! |
|
|
noorall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The works overall looks good. I have a comment. PTAL. @liuml07
| .map(x -> (String) x.get("newJobStatus")) | ||
| .collect(Collectors.toList()); | ||
|
|
||
| if (schedulerType == SchedulerType.Default) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using if (schedulerType == SchedulerType.Adaptive) as Default is a more general scheduler.
noorall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
@flinkbot run azure |
|
@liuml07 The flink robot is not available now. Could you please append an empty commit to trigger a new CI run? |
This reverts commit a63cd554e25a14f69adee3501d249a4129fa619f.
|
Sounds good. I rebased from master again. |
| "Test disconnectTaskManager exception.")), | ||
| jobEvents); | ||
|
|
||
| assertThat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed something—this part seems to have been omitted? @liuml07
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is part of the new helper function assertJobStatusTransitions(schedulerType, jobEvents) as this assertion is also about status change and it's duplicate between the two tests, so just reused it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is part of the new helper function
assertJobStatusTransitions(schedulerType, jobEvents)as this assertion is also about status change and it's duplicate between the two tests, so just reused it.
Got it, LGTM! 😊
| * DefaultScheduler does not emit CREATED state, while AdaptiveScheduler and | ||
| * AdaptiveBatchScheduler do. | ||
| */ | ||
| private static void assertJobStatusTransitions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in responding. It looks like we have got to a the root cause of this, in that we have 2 different types of scheduler that contain different job status's.
I see that this method checks for the schedulerType and performs different asserts. I cannot see from the test that this method is driven for each scheduler type. If the 2 callers of this method cover the 2 scheduler types - it would be good if this was more explicit in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slotPoolServiceSchedulerFactory is created from the configuration fromConfiguration() which would honor the dynamic scheduler config when running the tests. The default scheduler is used if no config overrides. According to @noorall we have CI tests that injects different scheduler for all tests. This test is more about JobMaster per se and iterating explicitly all schedulers seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidradl the maven profile 'enable-adaptive-scheduler' was introduced so that lots of existing tests can run against the AdaptiveScheduler which was added later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with @noorall and @zhuzhurk that he scheduler used in tests are not static, but can be set when running it. Specifically it's via the flink.tests.enable-adaptive-scheduler JVM args (or run tests via Maven profile enable-adaptive-scheduler). The slotPoolServiceSchedulerFactory creates schedulers dynamically by checking this property. With this PR, in this test we make different assertions accordingly.
By default, when this test is running, the DEFAULT scheduler is used so the else clause of this assertion method is called. When the test is running with either flink.tests.enable-adaptive-scheduler=true or with the Maven profile enable-adaptive-scheduler, the ADAPTIVE scheduler is used and the if clause of this assertion method is called. One can enable it locally for debugging, and in the nightly CI, one workflow is enabling it, see here.
I guess the purpose of adding the dynamic feature through the Maven profile 'enable-adaptive-scheduler' is likely to eliminate the need for existing tests to explicitly enumerate various scheduler types, as long as they are either agnostic or aware of scheduler type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidradl Please help to review again. Thanks.
zhuzhurk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
Thanks for fixing it! @liuml07 |
|
@liuml07 @noorall @davidradl @zhuzhurk Thanks for the fix and reviews. |
|
Thanks for all comments. Merged. |
https://issues.apache.org/jira/browse/FLINK-38404