[SPARK-52346][SQL] Declarative Pipeline `DataflowGraph` execution and event logging #51050

SCHJonathan · 2025-05-29T17:52:44Z

What changes were proposed in this pull request?

As described in Declarative Pipelines SPIP, after we parse user's code and represent datasets and dataflows in a DataflowGraph (from PR #51003), this PR add the functionality to execute the corresponding workloads based on the DataflowGraph in the following steps.

Step 1: Initialize the raw DataflowGraph
In PipelineExecution::runPipeline(), we will first initialize the dataflow graph by topologically sort the dependencies and also figure out the expected metadata (e.g., schema) for each dataset (DataflowGraph::resolve()). Also, we would run some pre-flight validations to caught some early errors like circular dependencies, create a streaming table with batch data source, etc (DataflowGraph::validate()).

Step 2: Materialized datasets defined in the DataflowGraph as empty table
After the graph is topologically sorted and validated and every dataset / flow has correct metadata populated, we go ahead publish the corresponding dataset in the metastore (which could be Hive, UC, or others) in DatasetManager::materializeDatasets(). For example, for each Materialized View and Streaming Table, it would register a empty table in the metastore with correct metadata (e.g., table schema, table properties, etc).

Step 3: Populate data to the registered empty table
After datasets have been registered to the metastore, inside TriggeredGraphExecution, we transform each dataflow defined in the DataflowGraph into a actual execution plan to run the actual workload and populate the data to the empty table (we transform DataflowGraph::flow into FlowExecution through FlowPlanner).

FlowExecution currently have two types:

for dataflow represents a batch execution, we convert it into a BatchFlowExecution
for dataflow represents a streaming execution, we convert it into a StreamingFlowExecution

Each FlowExecution will be executed in topological order based on the sorted DataflowGraph, and we would parallelize the execution as much as possible. Each execution could failed due to different type of error, some are transient and some are fatal and required user action (e.g., incompatible table schema evolution). Proper error retries are in place and proper event log would be emitted to indicate the current state of the execution. These event log would eventually be logged into the CLI console.

More details on the event logs.

RunProgress:STARTED: logged indicate we start running the user-configured pipeline.
FlowProgress::QUEUED: initially, all dataflow will be queued to be executed, except these dataflows explicitly marked as excluded from execution by users (using the --selective-refresh parameter when start a pipeline in CLI)
FlowProgress::EXCLUDED: logged if a dataflow is excluded from execution
FlowProgress::STARTING: logged when a dataflow is popped from the job queue and is ready to be planned
FlowProgress::PLANNING: logged when a dataflow is started to be planned and transformed into a FlowExecution
FlowProgress::RUNNING: logged when a FlowExecution has been started and the corresponding workload is running.
FlowProgress::COMPLETED: logged when a FlowExecution has successfully completed
FlowProgress::FAILED: logged when a FlowExecution has failed due to transient or fatal error. In case of transient error, the event log would be logged in WARN level and the execution would be re-queued for retry. In case of fatal error or transient errors exhausted all the retry attempts, we log the event log in ERROR level and ready to stop the entire execution.
FlowProgress::SKIPPED: logged when a upstream FlowExecution has failed fatally, all the downstream FlowExecution would be skipped.
FlowProgress::STOPPED: logged when users explicitly stop the pipeline while it's still running. All the actively running FlowExecution would be interrupted and stopped, and this event log would be emitted.

When all FlowExecutions either completed or one of them failed fatally, the pipeline reach the terminal state and corresponding event log would also be emitted.

RunProgress::COMPLETED: logged when all the FlowExecutions has completed successfully.
RunProgress::FAILED: logged when one (or more) of the FlowExecutions failed fatally and the pipeline execution is aborted.
RunProgress::CANCELED: logged when user explicitly stop the pipeline, all the running flow will be interrupted and stopped (FlowProgress::STOPPED event log would be emitted).

Why are the changes needed?

This PR implemented the core functionality to executing a Declarative Pipeline

Does this PR introduce any user-facing change?

Yes, see the first section.

How was this patch tested?

New unit test suite:

TriggeredGraphExecutionSuite: test end-to-end executions of the pipeline under different scenarios (happy path, failure path, etc) and validate proper data has been written and proper event log is emitted.

Augment existing test suites:

ConstructPipelineEventSuite and PipelineEventSuite to validate the new FlowProgress event log we're introducing.

Was this patch authored or co-authored using generative AI tooling?

No

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/TriggeredGraphExecution.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowExecution.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/TriggeredGraphExecution.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphExecution.scala

aakash-db added 11 commits May 23, 2025 13:12

1

f6423ba

2

a874f0d

3

1e14973

4

e15dc6a

5

395af50

6

c28a095

comments p1

1dbc38c

1

c53c17c

queryContext addition

12790a4

1

c78c705

error formatting

4405843

github-actions bot added SQL BUILD labels May 29, 2025

done

8470cb6

SCHJonathan force-pushed the graph-execution branch from 68d6d84 to 8470cb6 Compare May 29, 2025 19:23

sryza reviewed May 29, 2025

View reviewed changes

sryza self-assigned this May 29, 2025

sandy

a420156

SCHJonathan requested a review from sryza May 30, 2025 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52346][SQL] Declarative Pipeline `DataflowGraph` execution and event logging #51050

[SPARK-52346][SQL] Declarative Pipeline `DataflowGraph` execution and event logging #51050

Uh oh!

SCHJonathan commented May 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[SPARK-52346][SQL] Declarative Pipeline DataflowGraph execution and event logging #51050

Are you sure you want to change the base?

[SPARK-52346][SQL] Declarative Pipeline DataflowGraph execution and event logging #51050

Uh oh!

Conversation

SCHJonathan commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[SPARK-52346][SQL] Declarative Pipeline `DataflowGraph` execution and event logging #51050

[SPARK-52346][SQL] Declarative Pipeline `DataflowGraph` execution and event logging #51050

SCHJonathan commented May 29, 2025 •

edited

Loading