[SPARK-52283][SQL] Declarative Pipelines `DataflowGraph` creation and resolution #51003

aakash-db · 2025-05-23T20:32:21Z

What changes were proposed in this pull request?

This PR introduces the DataflowGraph, a container for Declarative Pipelines datasets and flows, as described in the Declarative Pipelines SPIP. It also adds functionality for

Constructing a graph by registering a set of graph elements in succession (GraphRegistrationContext)
"Resolving" a graph, which means resolving each of the flows within a graph. Resolving a flow means:
- Validating that its plan can be successfully analyzed
- Determining the schema of the data it will produce
- Determining what upstream datasets within the graph it depends on

It also introduces various secondary changes:

Changes to SparkBuild to support declarative pipelines.
Updates to the pom.xml for the module.
New error conditions

Why are the changes needed?

In order to implement Declarative Pipelines.

Does this PR introduce any user-facing change?

No changes to existing behavior.

How was this patch tested?

New test suites:

ConnectValidPipelineSuite – test cases where the graph can be successfully resolved
ConnectInvalidPipelineSuite – test cases where the graph fails to be resolved

Was this patch authored or co-authored using generative AI tooling?

No

project/SparkBuild.scala

sql/pipelines/pom.xml

project/SparkBuild.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphIdentifierManager.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/PipelinesErrors.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphIdentifierManager.scala

...pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphRegistrationContext.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala

project/SparkBuild.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala

...ipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectValidPipelineSuite.scala

...elines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectInvalidPipelineSuite.scala

jonmio

flushing some comments

jonmio · 2025-05-23T23:30:14Z

common/utils/src/main/resources/error/error-conditions.json

@@ -2025,6 +2031,18 @@
    ],
    "sqlState" : "42613"
  },
+  "INCOMPATIBLE_BATCH_VIEW_READ": {
+    "message": [
+      "View <datasetIdentifier> is not a streaming view and must be referenced using read. This check can be disabled by setting Spark conf pipelines.incompatibleViewCheck.enabled = false."


What is the purpose of this conf and do we really need it?

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/AnalysisWarning.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/Language.scala

...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala

jonmio · 2025-05-23T23:46:05Z

...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala

+   * @param upstreamNodes Upstream nodes for the node
+   * @return
+   */
+  def processNode(node: GraphElement, upstreamNodes: Seq[GraphElement]): Seq[GraphElement] = {


Nit: Document return. I'm especially curious why this is a Seq and when processNode would return more than one element

Right now, it's mostly just for flexibility - in case one node maps to several in the future.

...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala

project/SparkBuild.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/AnalysisWarning.scala

cloud-fan · 2025-05-28T03:53:48Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala

+  val materializedFlowIdentifiers: Set[TableIdentifier] = materializedFlows.map(_.identifier).toSet
+
+  /** Returns a [[Table]] given its identifier */
+  lazy val table: Map[TableIdentifier, Table] =


TableIdentifier only supports 3-level namespace. Shall we use Seq[String] to better support DS v2, which can have an arbitrary level of namespace?

Seq[String] is a bit hard to use here. We can switch to the DS v2 API after we create an encapsulation class to match TableIdentifier.

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/elements.scala

common/utils/src/main/resources/error/error-conditions.json

...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala

cloud-fan · 2025-05-29T06:24:25Z

...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala

+            f.inputs.toSeq
+              .map(availableResolvedInputs(_))
+              .filter {
+                // Input is a flow implies that the upstream table is a View.


I find it hard to understand this comment. We are resolving a flow, and the input of a flow can be other flows? Why it means the upstream table is view?

Because for views, we don't have a ViewInput object - we instead map the view's downstream directly to the view's upstream. Thus if a flow's upstream is a flow, the node it is writing to is a view. If a flow's upstream was a table, then there would be a TableInput object here.

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala

...pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraphTransformer.scala

sryza · 2025-05-30T00:02:12Z

@aakash-db

I'm still seeing some test failures:

https://github.com/aakash-db/spark/actions/runs/15329080067/job/43131215330#step:12:68 – seems like maybe duplicate dependencies in the pom.xml?
https://github.com/aakash-db/spark/actions/runs/15329080067/job/43131215320#step:14:5070 – seems like some docstrings have broken references?
https://github.com/aakash-db/spark/actions/runs/15329080067/job/43131111744#step:7:3213 – OOM in the Docker integration tests. I've seen this on other PRs – is this generally flaky @cloud-fan ?

cloud-fan · 2025-05-30T01:36:06Z

yea the docker test is generally flaky and we can ignore

cloud-fan · 2025-05-30T01:52:59Z

...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala

+        // 1. If table is present in context.fullRefreshTables
+        // 2. If table has any virtual inputs (flows or tables)
+        // 3. If the table pre-existing metadata is different from current metadata
+        val virtualTableInput = VirtualTableInput(


We need comments to explain why all table inputs are virtual

cloud-fan · 2025-05-30T02:20:05Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala

+  /**
+   * Returns a [[TableInput]], if one is available, that can be read from by downstream flows.
+   */
+  def tableInput(identifier: TableIdentifier): Option[TableInput] = table.get(identifier)


If tables are both input and output, shall we just have a single name to table map, instead of a output map and a tableInput method?

Actually there is already a table map here, why do we need the output map which is identical to the table map but just give different error message for duplication?

Hmm yeah it looks like we do not need a separate tableInput and we should take this out. I suspect this was a relic of an older version of a code where there was a subclass of DataflowGraph that overrode tableInput.

output is a little bit forward-looking. We will eventually allow DataflowGraphs to have sinks (described in the SPIP but not yet implemented here), and output will include both sinks and tables. However we can leave it out for now and add it back when we add sinks.

aakash-db added 2 commits May 23, 2025 13:12

1

f6423ba

2

a874f0d

github-actions bot added SQL BUILD labels May 23, 2025

3

1e14973

sryza changed the title ~~[SPARK-52283][CONNECT] SDP DataflowGraph creation and resolution~~ [SPARK-52283][CONNECT] Declarative Pipelines DataflowGraph creation and resolution May 23, 2025

4

e15dc6a

sryza self-requested a review May 23, 2025 21:01

sryza self-assigned this May 23, 2025

sryza reviewed May 23, 2025

View reviewed changes

5

395af50

sryza reviewed May 23, 2025

View reviewed changes

project/SparkBuild.scala Outdated Show resolved Hide resolved

sryza reviewed May 23, 2025

View reviewed changes

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala Outdated Show resolved Hide resolved

6

c28a095

sryza reviewed May 23, 2025

View reviewed changes

...ipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectValidPipelineSuite.scala Outdated Show resolved Hide resolved

sryza reviewed May 23, 2025

View reviewed changes

...elines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectInvalidPipelineSuite.scala Outdated Show resolved Hide resolved

jonmio reviewed May 23, 2025

View reviewed changes

apache deleted a comment from aakash-db May 27, 2025

aakash-db requested a review from sryza May 27, 2025 22:40

aakash-db changed the title ~~[SPARK-52283][CONNECT] Declarative Pipelines DataflowGraph creation and resolution~~ [SPARK-52283] Declarative Pipelines DataflowGraph creation and resolution May 27, 2025

aakash-db changed the title ~~[SPARK-52283] Declarative Pipelines DataflowGraph creation and resolution~~ [SPARK-52283][SQL] Declarative Pipelines DataflowGraph creation and resolution May 27, 2025

sryza requested a review from cloud-fan May 27, 2025 22:53