Skip to content

Conversation

@Mrhs121
Copy link

@Mrhs121 Mrhs121 commented Nov 13, 2025

Purpose of this pull request

close #10059

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Copy link
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job.
CI is not successful, you can refer to https://github.com/apache/seatunnel/pull/10060/checks?check_run_id=55259688662.

CI will run a few hours per time ^_^

Can you add an E2E test for enabling 2PC?

@davidzollo
Copy link
Contributor

When will the flushing=false state be reset?

@Mrhs121
Copy link
Author

Mrhs121 commented Nov 14, 2025

Thanks for pointing this out. The flushing flag should be reset to false immediately after the flush operation completest. I'll include the fix for resetting the flushing state along with the new E2E test in the next commit.

When will the flushing=false state be reset?

Thanks for pointing this out. The flushing flag should be reset to false immediately after the flush operation completest. I'll include the fix for resetting the flushing state along with the new E2E test in the next commit.

FYI, I noticed the CI failure was due to the job exceeding the 10-minute timeout limit.

@davidzollo
Copy link
Contributor

davidzollo commented Nov 14, 2025

Thanks for pointing this out. The flushing flag should be reset to false immediately after the flush operation completest. I'll include the fix for resetting the flushing state along with the new E2E test in the next commit.

When will the flushing=false state be reset?

Thanks for pointing this out. The flushing flag should be reset to false immediately after the flush operation completest. I'll include the fix for resetting the flushing state along with the new E2E test in the next commit.

FYI, I noticed the CI failure was due to the job exceeding the 10-minute timeout limit.

Good. you can also help fix CI ^_^
Usually reviewers will review a new PR when CI passed.

By the way, I think we can have a more in-depth communication to help you get familiar with SeaTunnel. Feel free to contact me on LinkedIn (David Zollo) or WeChat (taskflow). When adding me, please let me know your GitHub ID

@Mrhs121
Copy link
Author

Mrhs121 commented Nov 14, 2025

I have provide a pure test case #10069 to reproduction #10059

@github-actions github-actions bot added the e2e label Nov 15, 2025
}
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
Copy link
Author

@Mrhs121 Mrhs121 Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark 2.4, the DataWriter interface does not extend Closeable, so when the test case runs on the Spark 2.4 engine and the job fails, theclose()method of the DorisSinkWriter is never invoked. As a result, the threads inside the DorisSinkWriter remain alive and prevent the SeaTunnel job from terminating. Therefore, releasing resources here.
I'm not sure if this is a good fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidzollo I made some new changes, please help review it when you have time, thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not allowed to perform close in abortPrepare, and other connectors do not have such an implementation.
How about implementing Closeable in Spark2.4?

Copy link
Author

@zhangshenghang
Copy link
Member

I have provide a pure test case #10069 to reproduction #10059

We can merge #10069 into the current PR to verify that this issue will not occur.


public RespContent stopLoad() throws IOException {
loading = false;
flushing = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add a separate "flushing" instead of just using "loading"?

Copy link
Author

@Mrhs121 Mrhs121 Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you can take a look at the description of root cause first #10059 (comment).

As shown in the following code, the error msg in the http response will only be obtained when it is in the loading state, and this loading will be reset to false during flush. Therefore, if the flush action is executed before the http response is returned. This means that errorMessage will always be null, an infinite loop occurs, preventing the seatunnel task from stopping.

public String getLoadFailedMsg() {
if (!loading) {
return null;
}
if (this.getPendingLoadFuture() != null && this.getPendingLoadFuture().isDone()) {
String errorMessage;
try {
errorMessage = handlePreCommitResponse(pendingLoadFuture.get()).getMessage();
} catch (Exception e) {
errorMessage = ExceptionUtils.getMessage(e);
}
recordStream.setErrorMessageByStreamLoad(errorMessage);
return errorMessage;
} else {
return null;
}
}

Another way to fix it is to place the action of resetting the loading to false after the endInput, that is, the loading is only considered to have ended after the streaming is truly closed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of your second plan.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of your second plan.

Done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of your second plan.

@zhangshenghang I made some new changes, please help review it when you have time, thank you.

@Mrhs121
Copy link
Author

Mrhs121 commented Nov 21, 2025

@zhangshenghang I made some new changes, please help review it when you have time, thank you.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical issue where DorisStreamLoad's loading state mismanagement caused RecordBuffer to enter an infinite loop during shutdown, particularly when Doris returns parsing errors (e.g., ANALYSIS_ERROR). The fix ensures proper cleanup and state management across multiple components.

Key Changes

  • Moved the loading flag update to a finally block in DorisStreamLoad to ensure consistent state even when exceptions occur
  • Added try-catch-finally blocks in DorisSinkWriter.close() and SparkDataWriter.abort() to guarantee resource cleanup
  • Moved sinkWriter.close() from commit() to close() in Spark 3.3 DataWriter for proper lifecycle management
  • Added E2E tests to verify graceful failure handling when Doris returns cast errors

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
DorisStreamLoad.java Moved loading = false to finally block to prevent infinite loops in RecordBuffer when exceptions occur during stopLoad()
DorisSinkWriter.java Added try-catch-finally to ensure scheduledExecutorService and dorisStreamLoad are closed even if flush() fails
SeaTunnelSparkDataWriter.java Moved sinkWriter.close() and WriterCloseEvent from commit() to close() method for proper resource lifecycle
SparkDataWriter.java Enhanced abort() with try-catch-finally to ensure sinkWriter.close() is called even when abort operations fail
doris_source_and_sink_with_cast_error.conf Test configuration for cast error scenario with 2PC disabled
doris_source_and_sink_with_cast_error_2pc_true.conf Test configuration for cast error scenario with 2PC enabled
DorisIT.java Added testDorisCastError() to verify graceful failure when Doris returns type cast errors, and createTypeCastErrorSinkTableForTest() to create incompatible schema

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

flush();
}
} catch (Exception e) {
log.error("Flush data failed when close doris writer.", e);
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error in log message. Should be "when closing" instead of "when close".

Suggested change
log.error("Flush data failed when close doris writer.", e);
log.error("Flush data failed when closing doris writer.", e);

Copilot uses AI. Check for mistakes.
Comment on lines -95 to -96
sinkWriter.close();
context.getEventListener().onEvent(new WriterCloseEvent());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hisoka-X Will there be a problem? Why was it closed here before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this snippet was copied from the Spark 2 template and the author missed the subtle difference: DataWriter interface in Spark 2 doesn’t implement Closeable, so close it manually here.,(¬_¬)

@Mrhs121
Copy link
Author

Mrhs121 commented Dec 10, 2025

@zhangshenghang I made some new changes, please help review it when you have time, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [connector-doris] DorisStreamLoad loading state mismanagement causes RecordBuffer infinite loop during shutdown

3 participants