Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully #16066

cjac · 2025-11-15T00:58:47Z

This commit addresses an issue where CDAP pipelines were incorrectly marked as FAILED when ephemeral Dataproc cluster deprovisioning attempted to cancel a job that had already completed.

The following changes are included:

RemoteExecutionTwillController: Added a RuntimeJobStatus check before attempting to force kill a remote process in the complete() method's exception handler. This prevents sending a kill command to jobs already in a terminal state.
AbstractDataprocProvisioner: Modified deleteClusterWithStatus to specifically detect and handle the error returned by the Dataproc API when a CancelJob request is made on a job in the DONE state. This error is now logged as a warning and does not cause the pipeline to be marked as FAILED.
Unit Tests: Added new unit tests for both RemoteExecutionTwillController and DataprocProvisioner to verify the new logic and prevent regressions.
CONTRIBUTING.rst: Updated the issues link to the current JIRA URL.

These changes ensure that the pipeline status accurately reflects the execution result even if there are timing issues during cluster deprovisioning.

Fixes: b/460875216

This commit addresses an issue where CDAP pipelines were incorrectly marked as FAILED when ephemeral Dataproc cluster deprovisioning attempted to cancel a job that had already completed. The following changes are included: 1. **RemoteExecutionTwillController:** Added a RuntimeJobStatus check before attempting to force kill a remote process in the `complete()` method's exception handler. This prevents sending a kill command to jobs already in a terminal state. 2. **AbstractDataprocProvisioner:** Modified `deleteClusterWithStatus` to specifically detect and handle the error returned by the Dataproc API when a CancelJob request is made on a job in the DONE state. This error is now logged as a warning and does not cause the pipeline to be marked as FAILED. 3. **Unit Tests:** Added new unit tests for both `RemoteExecutionTwillController` and `DataprocProvisioner` to verify the new logic and prevent regressions. 4. **CONTRIBUTING.rst:** Updated the issues link to the current JIRA URL. These changes ensure that the pipeline status accurately reflects the execution result even if there are timing issues during cluster deprovisioning. Fixes: b/460875216

sahusanket · 2025-11-25T16:43:02Z

...ava/io/cdap/cdap/internal/app/runtime/distributed/remote/RemoteExecutionTwillController.java

      try {
-        LOG.debug("Force termination of remote process for program run {}", programRunId);
-        remoteProcessController.kill(RuntimeJobStatus.RUNNING);
+        RuntimeJobStatus currentStatus = remoteProcessController.getStatus();


The Prior Logic :

While STATUS is RUNNING , keep checking every second.

If it exceeds for more than 5 seconds, then throw IllegalStateException

So the moment getStatus is called, the 5 second check is done without any gap and immediately goes to catch block for force termination.

I agree between this few MS the dataproc job status could be DONE

But in the new extra check, the GAP for error still exists and this intermittent Wrong killing of pipeline would still happen.

My point is there is not much time gap between the existing getStatus == running and remoteProcessController.kill() , and similar to extra check..

sahusanket · 2025-11-25T16:55:48Z

...src/main/java/io/cdap/cdap/runtime/spi/provisioner/dataproc/AbstractDataprocProvisioner.java

-                    ((DataprocRuntimeJobDetail) jobDetail).getJobId(), statusDetails));
-            LOG.error("Dataproc Job {}", jobDetail.getStatus(), e);
+            // Check if the failure is due to attempting to cancel a job already DONE
+            if (jobDetail.getStatus() == RuntimeJobStatus.FAILED && statusDetails.contains("is not supported in the current state: DONE")) {


This exception of dataproc seems to be covered under FAILED_PRECONDITION

and we are already handling it DataprocRuntimeJobManager.java#L923

So, this check might not work.

We are assuming Failure for all conditions of FAILED_PRECONDITION , may be we can have a specific check there.

adrikagupta requested a review from sahusanket November 17, 2025 05:48

cjac force-pushed the fix/cdf-deprovision-race branch from f5a62b0 to 225ca01 Compare November 20, 2025 17:51

sahusanket reviewed Nov 25, 2025

View reviewed changes

sahusanket changed the title ~~Fix(b/460875216): Handle CancelJob on DONE Dataproc jobs gracefully~~ Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully Nov 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully #16066

Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully #16066

Uh oh!

cjac commented Nov 15, 2025

Uh oh!

sahusanket Nov 25, 2025

Uh oh!

sahusanket Nov 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully #16066

Are you sure you want to change the base?

Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully #16066

Uh oh!

Conversation

cjac commented Nov 15, 2025

Uh oh!

sahusanket Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

sahusanket Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sahusanket Nov 25, 2025 •

edited

Loading