-
Notifications
You must be signed in to change notification settings - Fork 350
Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully #16066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully #16066
Conversation
This commit addresses an issue where CDAP pipelines were incorrectly
marked as FAILED when ephemeral Dataproc cluster deprovisioning
attempted to cancel a job that had already completed.
The following changes are included:
1. **RemoteExecutionTwillController:** Added a RuntimeJobStatus check
before attempting to force kill a remote process in the `complete()`
method's exception handler. This prevents sending a kill command
to jobs already in a terminal state.
2. **AbstractDataprocProvisioner:** Modified `deleteClusterWithStatus`
to specifically detect and handle the error returned by the Dataproc
API when a CancelJob request is made on a job in the DONE state.
This error is now logged as a warning and does not cause the
pipeline to be marked as FAILED.
3. **Unit Tests:** Added new unit tests for both
`RemoteExecutionTwillController` and `DataprocProvisioner` to
verify the new logic and prevent regressions.
4. **CONTRIBUTING.rst:** Updated the issues link to the current JIRA URL.
These changes ensure that the pipeline status accurately reflects the
execution result even if there are timing issues during cluster
deprovisioning.
Fixes: b/460875216
f5a62b0 to
225ca01
Compare
| try { | ||
| LOG.debug("Force termination of remote process for program run {}", programRunId); | ||
| remoteProcessController.kill(RuntimeJobStatus.RUNNING); | ||
| RuntimeJobStatus currentStatus = remoteProcessController.getStatus(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Prior Logic :
- While STATUS is RUNNING , keep checking every second.
- If it exceeds for more than 5 seconds, then throw
IllegalStateException - So the moment
getStatusis called, the 5 second check is done without any gap and immediately goes to catch block for force termination.
I agree between this few MS the dataproc job status could be DONE
But in the new extra check, the GAP for error still exists and this intermittent Wrong killing of pipeline would still happen.
My point is there is not much time gap between the existing getStatus == running and remoteProcessController.kill() , and similar to extra check..
| ((DataprocRuntimeJobDetail) jobDetail).getJobId(), statusDetails)); | ||
| LOG.error("Dataproc Job {}", jobDetail.getStatus(), e); | ||
| // Check if the failure is due to attempting to cancel a job already DONE | ||
| if (jobDetail.getStatus() == RuntimeJobStatus.FAILED && statusDetails.contains("is not supported in the current state: DONE")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception of dataproc seems to be covered under FAILED_PRECONDITION
and we are already handling it DataprocRuntimeJobManager.java#L923
So, this check might not work.
We are assuming Failure for all conditions of FAILED_PRECONDITION , may be we can have a specific check there.
This commit addresses an issue where CDAP pipelines were incorrectly marked as FAILED when ephemeral Dataproc cluster deprovisioning attempted to cancel a job that had already completed.
The following changes are included:
RemoteExecutionTwillController: Added a RuntimeJobStatus check before attempting to force kill a remote process in the
complete()method's exception handler. This prevents sending a kill command to jobs already in a terminal state.AbstractDataprocProvisioner: Modified
deleteClusterWithStatusto specifically detect and handle the error returned by the Dataproc API when a CancelJob request is made on a job in the DONE state. This error is now logged as a warning and does not cause the pipeline to be marked as FAILED.Unit Tests: Added new unit tests for both
RemoteExecutionTwillControllerandDataprocProvisionerto verify the new logic and prevent regressions.CONTRIBUTING.rst: Updated the issues link to the current JIRA URL.
These changes ensure that the pipeline status accurately reflects the execution result even if there are timing issues during cluster deprovisioning.
Fixes: b/460875216