Skip to content

Conversation

@cjac
Copy link

@cjac cjac commented Nov 15, 2025

This commit addresses an issue where CDAP pipelines were incorrectly marked as FAILED when ephemeral Dataproc cluster deprovisioning attempted to cancel a job that had already completed.

The following changes are included:

  1. RemoteExecutionTwillController: Added a RuntimeJobStatus check before attempting to force kill a remote process in the complete() method's exception handler. This prevents sending a kill command to jobs already in a terminal state.

  2. AbstractDataprocProvisioner: Modified deleteClusterWithStatus to specifically detect and handle the error returned by the Dataproc API when a CancelJob request is made on a job in the DONE state. This error is now logged as a warning and does not cause the pipeline to be marked as FAILED.

  3. Unit Tests: Added new unit tests for both RemoteExecutionTwillController and DataprocProvisioner to verify the new logic and prevent regressions.

  4. CONTRIBUTING.rst: Updated the issues link to the current JIRA URL.

These changes ensure that the pipeline status accurately reflects the execution result even if there are timing issues during cluster deprovisioning.

Fixes: b/460875216

This commit addresses an issue where CDAP pipelines were incorrectly
marked as FAILED when ephemeral Dataproc cluster deprovisioning
attempted to cancel a job that had already completed.

The following changes are included:

1.  **RemoteExecutionTwillController:** Added a RuntimeJobStatus check
    before attempting to force kill a remote process in the `complete()`
    method's exception handler. This prevents sending a kill command
    to jobs already in a terminal state.

2.  **AbstractDataprocProvisioner:** Modified `deleteClusterWithStatus`
    to specifically detect and handle the error returned by the Dataproc
    API when a CancelJob request is made on a job in the DONE state.
    This error is now logged as a warning and does not cause the
    pipeline to be marked as FAILED.

3.  **Unit Tests:** Added new unit tests for both
    `RemoteExecutionTwillController` and `DataprocProvisioner` to
    verify the new logic and prevent regressions.

4.  **CONTRIBUTING.rst:** Updated the issues link to the current JIRA URL.

These changes ensure that the pipeline status accurately reflects the
execution result even if there are timing issues during cluster
deprovisioning.

Fixes: b/460875216
@cjac cjac force-pushed the fix/cdf-deprovision-race branch from f5a62b0 to 225ca01 Compare November 20, 2025 17:51
try {
LOG.debug("Force termination of remote process for program run {}", programRunId);
remoteProcessController.kill(RuntimeJobStatus.RUNNING);
RuntimeJobStatus currentStatus = remoteProcessController.getStatus();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Prior Logic :

  • While STATUS is RUNNING , keep checking every second.
  • If it exceeds for more than 5 seconds, then throw IllegalStateException
  • So the moment getStatus is called, the 5 second check is done without any gap and immediately goes to catch block for force termination.

I agree between this few MS the dataproc job status could be DONE

But in the new extra check, the GAP for error still exists and this intermittent Wrong killing of pipeline would still happen.

My point is there is not much time gap between the existing getStatus == running and remoteProcessController.kill() , and similar to extra check..

((DataprocRuntimeJobDetail) jobDetail).getJobId(), statusDetails));
LOG.error("Dataproc Job {}", jobDetail.getStatus(), e);
// Check if the failure is due to attempting to cancel a job already DONE
if (jobDetail.getStatus() == RuntimeJobStatus.FAILED && statusDetails.contains("is not supported in the current state: DONE")) {
Copy link
Contributor

@sahusanket sahusanket Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception of dataproc seems to be covered under FAILED_PRECONDITION

and we are already handling it DataprocRuntimeJobManager.java#L923

So, this check might not work.

We are assuming Failure for all conditions of FAILED_PRECONDITION , may be we can have a specific check there.

@sahusanket sahusanket changed the title Fix(b/460875216): Handle CancelJob on DONE Dataproc jobs gracefully Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants