-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52346][SQL] Declarative Pipeline DataflowGraph
execution and event logging
#51050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SCHJonathan
wants to merge
13
commits into
apache:master
Choose a base branch
from
SCHJonathan:graph-execution
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
68d6d84
to
8470cb6
Compare
sryza
reviewed
May 29, 2025
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/TriggeredGraphExecution.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowExecution.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowExecution.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowExecution.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/TriggeredGraphExecution.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/TriggeredGraphExecution.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphExecution.scala
Outdated
Show resolved
Hide resolved
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
As described in Declarative Pipelines SPIP, after we parse user's code and represent datasets and dataflows in a
DataflowGraph
(from PR #51003), this PR add the functionality to execute the corresponding workloads based on theDataflowGraph
in the following steps.Step 1: Initialize the raw
DataflowGraph
In
PipelineExecution::runPipeline()
, we will first initialize the dataflow graph by topologically sort the dependencies and also figure out the expected metadata (e.g., schema) for each dataset (DataflowGraph::resolve()
). Also, we would run some pre-flight validations to caught some early errors like circular dependencies, create a streaming table with batch data source, etc (DataflowGraph::validate()
).Step 2: Materialized datasets defined in the
DataflowGraph
as empty tableAfter the graph is topologically sorted and validated and every dataset / flow has correct metadata populated, we go ahead publish the corresponding dataset in the metastore (which could be Hive, UC, or others) in
DatasetManager::materializeDatasets()
. For example, for each Materialized View and Streaming Table, it would register a empty table in the metastore with correct metadata (e.g., table schema, table properties, etc).Step 3: Populate data to the registered empty table
After datasets have been registered to the metastore, inside
TriggeredGraphExecution
, we transform each dataflow defined in theDataflowGraph
into a actual execution plan to run the actual workload and populate the data to the empty table (we transformDataflowGraph::flow
intoFlowExecution
throughFlowPlanner
).FlowExecution
currently have two types:BatchFlowExecution
StreamingFlowExecution
Each
FlowExecution
will be executed in topological order based on the sortedDataflowGraph
, and we would parallelize the execution as much as possible. Each execution could failed due to different type of error, some are transient and some are fatal and required user action (e.g., incompatible table schema evolution). Proper error retries are in place and proper event log would be emitted to indicate the current state of the execution. These event log would eventually be logged into the CLI console.More details on the event logs.
RunProgress:STARTED
: logged indicate we start running the user-configured pipeline.FlowProgress::QUEUED
: initially, all dataflow will be queued to be executed, except these dataflows explicitly marked as excluded from execution by users (using the--selective-refresh
parameter when start a pipeline in CLI)FlowProgress::EXCLUDED
: logged if a dataflow is excluded from executionFlowProgress::STARTING
: logged when a dataflow is popped from the job queue and is ready to be plannedFlowProgress::PLANNING
: logged when a dataflow is started to be planned and transformed into aFlowExecution
FlowProgress::RUNNING
: logged when aFlowExecution
has been started and the corresponding workload is running.FlowProgress::COMPLETED
: logged when aFlowExecution
has successfully completedFlowProgress::FAILED
: logged when aFlowExecution
has failed due to transient or fatal error. In case of transient error, the event log would be logged inWARN
level and the execution would be re-queued for retry. In case of fatal error or transient errors exhausted all the retry attempts, we log the event log inERROR
level and ready to stop the entire execution.FlowProgress::SKIPPED
: logged when a upstreamFlowExecution
has failed fatally, all the downstreamFlowExecution
would be skipped.FlowProgress::STOPPED
: logged when users explicitly stop the pipeline while it's still running. All the actively runningFlowExecution
would be interrupted and stopped, and this event log would be emitted.When all
FlowExecution
s either completed or one of them failed fatally, the pipeline reach the terminal state and corresponding event log would also be emitted.RunProgress::COMPLETED
: logged when all theFlowExecution
s has completed successfully.RunProgress::FAILED
: logged when one (or more) of theFlowExecution
s failed fatally and the pipeline execution is aborted.RunProgress::CANCELED
: logged when user explicitly stop the pipeline, all the running flow will be interrupted and stopped (FlowProgress::STOPPED
event log would be emitted).Why are the changes needed?
This PR implemented the core functionality to executing a Declarative Pipeline
Does this PR introduce any user-facing change?
Yes, see the first section.
How was this patch tested?
New unit test suite:
TriggeredGraphExecutionSuite
: test end-to-end executions of the pipeline under different scenarios (happy path, failure path, etc) and validate proper data has been written and proper event log is emitted.Augment existing test suites:
ConstructPipelineEventSuite
andPipelineEventSuite
to validate the new FlowProgress event log we're introducing.Was this patch authored or co-authored using generative AI tooling?
No