Skip to content

TEZ-4725: Fix flaky tests in recovery, history-parser, and MRR integration suites#510

Open
maheshrajus wants to merge 1 commit into
apache:masterfrom
maheshrajus:TEZ-4725
Open

TEZ-4725: Fix flaky tests in recovery, history-parser, and MRR integration suites#510
maheshrajus wants to merge 1 commit into
apache:masterfrom
maheshrajus:TEZ-4725

Conversation

@maheshrajus

@maheshrajus maheshrajus commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

TEZ-4725: Fix flaky tests in recovery, history-parser, and MRR integration suites

Root causes and fixes:

  1. TestAMRecoveryAggregationBroadcast.testMapJoinTemporalFailure (race condition)

    • Replace fixed Thread.sleep(10s) before AM kill with a deterministic
      waitForVertexSucceeded() helper that polls DAGClient.getVertexStatus()
      every 500ms (up to 60s) until the target vertices reach SUCCEEDED state.
    • Each test now waits for only the vertices it logically depends on before
      killing the AM, ensuring the recovery log assertions always see the
      expected counts.
    • Make OUT_PATH unique per test run (random suffix) to eliminate cross-test.
  2. DAGClientRPCImpl / TezClientUtils: port out of range:-1 (YARN-808 gap)

    • YARN sets rpcPort=-1 when an AM container is allocated (state=RUNNING)
      but the AM has not yet bound its RPC listener. The existing guard only
      checked rpcPort==0 (protobuf default), so rpcPort==-1 reached
      NetUtils.createSocketAddrForHost(), which threw
      IllegalArgumentException: port out of range:-1.
    • Fix DAGClientRPCImpl.createAMProxyIfNeeded(): rpcPort == 0 → rpcPort <= 0.
    • Fix TezClientUtils.getAMProxy(FrameworkClient,...): add the same
      rpcPort <= 0 guard
  3. TezClient.waitNonSessionTillReady(): infinite loop on slow CI

    • In non-session mode, when YARN reports state=RUNNING but rpcPort is still
      0 or -1, getAMStatus() catches the resulting ServiceException/IOException
      and returns INITIALIZING. Without a deadline the loop spun forever.
    • Add a deadline based on TEZ_SESSION_CLIENT_TIMEOUT_SECS (default 120s),
      which already represents "how long the client will wait for the AM to
      become contactable". On expiry a TezException is thrown with the
      application ID.
  4. TestHistoryParser.testParserWithSuccessfulJob (empty zip entry)

    • ATS processes history events asynchronously; the DAG can finish before
      all vertex/task events are flushed to ATS. ATSImportTool.download() did
      not close ZipOutputStream in its finally block (only FileOutputStream was
      closed), leaving a truncated zip with empty entries on the disk.
  5. TestMRRJobsDAGApi.testMultipleMRRSleepJobViaSession (DAG SUCCEEDED ≠ AM READY)

    • In session mode, DAG reaching SUCCEEDED does not mean the AM session has
      transitioned back to READY; there is a short cleanup window. On slow CI
      the subsequent getAppMasterStatus() call returned RUNNING instead of READY.
    • Add tezSession.waitTillReady() before each assertEquals(READY, ...) so
      the test waits for the actual state transition.
  6. TestMRRJobsDAGApi: all @test(timeout=60000) increased to 120000

    • waitNonSessionTillReady() now has an internal deadline of
      TEZ_SESSION_CLIENT_TIMEOUT_SECS (120s).

@tez-yetus

Copy link
Copy Markdown

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 8m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 5m 11s master passed
+1 💚 compile 4m 2s master passed
+1 💚 checkstyle 0m 31s master passed
+1 💚 javadoc 0m 26s master passed
+0 🆗 spotbugs 0m 54s tez-tests in master has 6 extant spotbugs warnings.
_ Patch Compile Tests _
+1 💚 mvninstall 4m 1s the patch passed
+1 💚 codespell 1m 49s No new issues.
+1 💚 compile 4m 0s the patch passed
+1 💚 javac 4m 0s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 26s the patch passed
+1 💚 javadoc 0m 24s the patch passed
+1 💚 spotbugs 1m 3s the patch passed
_ Other Tests _
+1 💚 unit 70m 52s root in the patch passed.
+1 💚 asflicense 0m 28s The patch does not generate ASF License warnings.
104m 40s
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/1/artifact/out/Dockerfile
Optional Tests dupname compile unit asflicense javac javadoc spotbugs checkstyle codespell detsecrets
uname Linux 79e8c3d814b0 5.15.0-181-generic #191-Ubuntu SMP Fri May 22 19:09:02 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality tez-personality.sh
git revision master / 92dedd5
Default Java Eclipse Adoptium-21.0.11+10-LTS
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/1/testReport/
Max. process+thread count 1383 (vs. ulimit of 5500)
modules C: tez-tests U: tez-tests
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/1/console
versions git=2.43.0 maven=3.9.15 spotbugs=4.9.3 codespell=2.4.1
Powered by Apache Yetus 0.15.1 https://yetus.apache.org

This message was automatically generated.

@tez-yetus

Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 9m 4s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+0 🆗 mvndep 0m 34s Maven dependency ordering for branch
+1 💚 mvninstall 4m 50s master passed
+1 💚 compile 4m 0s master passed
+1 💚 checkstyle 0m 59s master passed
+1 💚 javadoc 1m 8s master passed
+0 🆗 spotbugs 1m 23s tez-api in master has 92 extant spotbugs warnings.
+0 🆗 spotbugs 0m 51s tez-tests in master has 6 extant spotbugs warnings.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 9s Maven dependency ordering for patch
+1 💚 mvninstall 3m 58s the patch passed
+1 💚 codespell 1m 46s No new issues.
+1 💚 compile 3m 57s the patch passed
+1 💚 javac 3m 57s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 54s the patch passed
+1 💚 javadoc 1m 7s the patch passed
+1 💚 spotbugs 2m 37s the patch passed
_ Other Tests _
-1 ❌ unit 78m 34s /patch-unit-root.txt root in the patch passed.
+1 💚 asflicense 0m 57s The patch does not generate ASF License warnings.
118m 52s
Reason Tests
Failed junit tests tez.history.TestHistoryParser
tez.mapreduce.TestMRRJobsDAGApi
Subsystem Report/Notes
Docker ClientAPI=1.55 ServerAPI=1.55 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/2/artifact/out/Dockerfile
Optional Tests dupname compile unit asflicense javac javadoc spotbugs checkstyle codespell detsecrets
uname Linux d085e89e2589 5.15.0-181-generic #191-Ubuntu SMP Fri May 22 19:09:02 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality tez-personality.sh
git revision master / 72a42de
Default Java Eclipse Adoptium-21.0.11+10-LTS
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/2/testReport/
Max. process+thread count 1471 (vs. ulimit of 5500)
modules C: tez-api tez-tests U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/2/console
versions git=2.43.0 maven=3.9.15 spotbugs=4.9.3 codespell=2.4.1
Powered by Apache Yetus 0.15.1 https://yetus.apache.org

This message was automatically generated.

@tez-yetus

Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 28s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 1s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ master Compile Tests _
+0 🆗 mvndep 0m 39s Maven dependency ordering for branch
+1 💚 mvninstall 4m 49s master passed
+1 💚 compile 3m 56s master passed
+1 💚 checkstyle 1m 26s master passed
+1 💚 javadoc 1m 33s master passed
+0 🆗 spotbugs 1m 24s tez-api in master has 92 extant spotbugs warnings.
+0 🆗 spotbugs 0m 50s tez-tests in master has 6 extant spotbugs warnings.
+0 🆗 spotbugs 0m 47s tez-plugins/tez-history-parser in master has 21 extant spotbugs warnings.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for patch
+1 💚 mvninstall 4m 2s the patch passed
-1 ❌ codespell 1m 46s /results-codespell.txt The patch generated 1 new + 7 unchanged - 0 fixed = 8 total (was 7)
+1 💚 compile 3m 58s the patch passed
+1 💚 javac 3m 58s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 19s the patch passed
+1 💚 javadoc 1m 32s the patch passed
+1 💚 spotbugs 3m 38s the patch passed
_ Other Tests _
+1 💚 unit 71m 25s root in the patch passed.
+1 💚 asflicense 1m 21s The patch does not generate ASF License warnings.
107m 31s
Subsystem Report/Notes
Docker ClientAPI=1.55 ServerAPI=1.55 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/3/artifact/out/Dockerfile
Optional Tests dupname compile unit asflicense javac javadoc spotbugs checkstyle codespell detsecrets
uname Linux a1e0dd6085be 5.15.0-181-generic #191-Ubuntu SMP Fri May 22 19:09:02 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality tez-personality.sh
git revision master / f924848
Default Java Eclipse Adoptium-21.0.11+10-LTS
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/3/testReport/
Max. process+thread count 1431 (vs. ulimit of 5500)
modules C: tez-api tez-tests tez-plugins/tez-history-parser U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/3/console
versions git=2.43.0 maven=3.9.15 spotbugs=4.9.3 codespell=2.4.1
Powered by Apache Yetus 0.15.1 https://yetus.apache.org

This message was automatically generated.

@tez-yetus

Copy link
Copy Markdown

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 29s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ master Compile Tests _
+0 🆗 mvndep 0m 36s Maven dependency ordering for branch
+1 💚 mvninstall 4m 47s master passed
+1 💚 compile 3m 57s master passed
+1 💚 checkstyle 1m 22s master passed
+1 💚 javadoc 1m 33s master passed
+0 🆗 spotbugs 1m 23s tez-api in master has 92 extant spotbugs warnings.
+0 🆗 spotbugs 0m 52s tez-tests in master has 6 extant spotbugs warnings.
+0 🆗 spotbugs 0m 46s tez-plugins/tez-history-parser in master has 21 extant spotbugs warnings.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for patch
+1 💚 mvninstall 4m 2s the patch passed
+1 💚 codespell 1m 48s No new issues.
+1 💚 compile 3m 56s the patch passed
+1 💚 javac 3m 56s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 19s the patch passed
+1 💚 javadoc 1m 32s the patch passed
+1 💚 spotbugs 3m 39s the patch passed
_ Other Tests _
+1 💚 unit 70m 38s root in the patch passed.
+1 💚 asflicense 1m 21s The patch does not generate ASF License warnings.
106m 34s
Subsystem Report/Notes
Docker ClientAPI=1.55 ServerAPI=1.55 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/4/artifact/out/Dockerfile
Optional Tests dupname compile unit asflicense javac javadoc spotbugs checkstyle codespell detsecrets
uname Linux 1deca8a52811 5.15.0-181-generic #191-Ubuntu SMP Fri May 22 19:09:02 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality tez-personality.sh
git revision master / f924848
Default Java Eclipse Adoptium-21.0.11+10-LTS
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/4/testReport/
Max. process+thread count 1466 (vs. ulimit of 5500)
modules C: tez-api tez-tests tez-plugins/tez-history-parser U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/4/console
versions git=2.43.0 maven=3.9.15 spotbugs=4.9.3 codespell=2.4.1
Powered by Apache Yetus 0.15.1 https://yetus.apache.org

This message was automatically generated.

@maheshrajus maheshrajus changed the title TEZ-4725: [WIP] Fix flaky tests in TestAMRecoveryAggregationBroadcast TEZ-4725: Fix flaky tests in recovery, history-parser, and MRR integration suites Jun 24, 2026
@maheshrajus

Copy link
Copy Markdown
Contributor Author

@abstractdog @ayushtkn
Could you please review the PR at your convenience?
Thank you !

@tez-yetus

Copy link
Copy Markdown

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 28s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ master Compile Tests _
+0 🆗 mvndep 0m 36s Maven dependency ordering for branch
+1 💚 mvninstall 4m 48s master passed
+1 💚 compile 3m 56s master passed
+1 💚 checkstyle 1m 24s master passed
+1 💚 javadoc 1m 33s master passed
+0 🆗 spotbugs 1m 24s tez-api in master has 92 extant spotbugs warnings.
+0 🆗 spotbugs 0m 51s tez-tests in master has 6 extant spotbugs warnings.
+0 🆗 spotbugs 0m 47s tez-plugins/tez-history-parser in master has 21 extant spotbugs warnings.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for patch
+1 💚 mvninstall 4m 3s the patch passed
+1 💚 codespell 1m 45s No new issues.
+1 💚 compile 3m 58s the patch passed
+1 💚 javac 3m 58s the patch passed
+1 💚 blanks 0m 1s The patch has no blanks issues.
+1 💚 checkstyle 1m 17s the patch passed
+1 💚 javadoc 1m 31s the patch passed
+1 💚 spotbugs 3m 37s the patch passed
_ Other Tests _
+1 💚 unit 69m 27s root in the patch passed.
+1 💚 asflicense 1m 21s The patch does not generate ASF License warnings.
105m 22s
Subsystem Report/Notes
Docker ClientAPI=1.55 ServerAPI=1.55 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/5/artifact/out/Dockerfile
Optional Tests dupname compile unit asflicense javac javadoc spotbugs checkstyle codespell detsecrets
uname Linux 67c899afb819 5.15.0-181-generic #191-Ubuntu SMP Fri May 22 19:09:02 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality tez-personality.sh
git revision master / 61cc636
Default Java Eclipse Adoptium-21.0.11+10-LTS
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/5/testReport/
Max. process+thread count 1356 (vs. ulimit of 5500)
modules C: tez-api tez-tests tez-plugins/tez-history-parser U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-510/5/console
versions git=2.43.0 maven=3.9.15 spotbugs=4.9.3 codespell=2.4.1
Powered by Apache Yetus 0.15.1 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants