-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52283][SQL] Declarative Pipelines DataflowGraph
creation and resolution
#51003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
DataflowGraph
creation and resolutionDataflowGraph
creation and resolution
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphIdentifierManager.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/PipelinesErrors.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphIdentifierManager.scala
Outdated
Show resolved
Hide resolved
...pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphRegistrationContext.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala
Outdated
Show resolved
Hide resolved
...ipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectValidPipelineSuite.scala
Outdated
Show resolved
Hide resolved
...elines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectInvalidPipelineSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing some comments
@@ -2025,6 +2031,18 @@ | |||
], | |||
"sqlState" : "42613" | |||
}, | |||
"INCOMPATIBLE_BATCH_VIEW_READ": { | |||
"message": [ | |||
"View <datasetIdentifier> is not a streaming view and must be referenced using read. This check can be disabled by setting Spark conf pipelines.incompatibleViewCheck.enabled = false." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this conf and do we really need it?
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/AnalysisWarning.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/Language.scala
Show resolved
Hide resolved
...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala
Outdated
Show resolved
Hide resolved
...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala
Outdated
Show resolved
Hide resolved
* @param upstreamNodes Upstream nodes for the node | ||
* @return | ||
*/ | ||
def processNode(node: GraphElement, upstreamNodes: Seq[GraphElement]): Seq[GraphElement] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Document return. I'm especially curious why this is a Seq and when processNode would return more than one element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, it's mostly just for flexibility - in case one node maps to several in the future.
...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala
Show resolved
Hide resolved
...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala
Outdated
Show resolved
Hide resolved
DataflowGraph
creation and resolutionDataflowGraph
creation and resolution
DataflowGraph
creation and resolutionDataflowGraph
creation and resolution
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/AnalysisWarning.scala
Show resolved
Hide resolved
val materializedFlowIdentifiers: Set[TableIdentifier] = materializedFlows.map(_.identifier).toSet | ||
|
||
/** Returns a [[Table]] given its identifier */ | ||
lazy val table: Map[TableIdentifier, Table] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableIdentifier
only supports 3-level namespace. Shall we use Seq[String]
to better support DS v2, which can have an arbitrary level of namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq[String]
is a bit hard to use here. We can switch to the DS v2 API after we create an encapsulation class to match TableIdentifier
.
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/elements.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/elements.scala
Outdated
Show resolved
Hide resolved
...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala
Show resolved
Hide resolved
...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/CoreDataflowNodeProcessor.scala
Show resolved
Hide resolved
f.inputs.toSeq | ||
.map(availableResolvedInputs(_)) | ||
.filter { | ||
// Input is a flow implies that the upstream table is a View. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it hard to understand this comment. We are resolving a flow, and the input of a flow can be other flows? Why it means the upstream table is view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for views, we don't have a ViewInput
object - we instead map the view's downstream directly to the view's upstream. Thus if a flow's upstream is a flow, the node it is writing to is a view. If a flow's upstream was a table, then there would be a TableInput object here.
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala
Show resolved
Hide resolved
...pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraphTransformer.scala
Show resolved
Hide resolved
I'm still seeing some test failures:
|
yea the docker test is generally flaky and we can ignore |
// 1. If table is present in context.fullRefreshTables | ||
// 2. If table has any virtual inputs (flows or tables) | ||
// 3. If the table pre-existing metadata is different from current metadata | ||
val virtualTableInput = VirtualTableInput( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need comments to explain why all table inputs are virtual
/** | ||
* Returns a [[TableInput]], if one is available, that can be read from by downstream flows. | ||
*/ | ||
def tableInput(identifier: TableIdentifier): Option[TableInput] = table.get(identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If tables are both input and output, shall we just have a single name to table map, instead of a output
map and a tableInput
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there is already a table
map here, why do we need the output
map which is identical to the table
map but just give different error message for duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah it looks like we do not need a separate tableInput
and we should take this out. I suspect this was a relic of an older version of a code where there was a subclass of DataflowGraph
that overrode tableInput
.
output
is a little bit forward-looking. We will eventually allow DataflowGraph
s to have sinks (described in the SPIP but not yet implemented here), and output
will include both sinks and tables. However we can leave it out for now and add it back when we add sinks.
What changes were proposed in this pull request?
This PR introduces the
DataflowGraph
, a container for Declarative Pipelines datasets and flows, as described in the Declarative Pipelines SPIP. It also adds functionality forGraphRegistrationContext
)It also introduces various secondary changes:
SparkBuild
to support declarative pipelines.pom.xml
for the module.Why are the changes needed?
In order to implement Declarative Pipelines.
Does this PR introduce any user-facing change?
No changes to existing behavior.
How was this patch tested?
New test suites:
ConnectValidPipelineSuite
– test cases where the graph can be successfully resolvedConnectInvalidPipelineSuite
– test cases where the graph fails to be resolvedWas this patch authored or co-authored using generative AI tooling?
No