-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53687][SQL][SS][SDP] Introduce WATERMARK clause in SQL statement #52428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cc. @sryza @cloud-fan @viirya Also, do we have test purposed TVF definition for "streaming source"? I couldn't add E2E test for STREAM TVF + WATERMARK and it'd be ideal if we could add one. |
cc. @anishshri-db as well |
"Doesn't support month or year interval: <interval>" | ||
] | ||
}, | ||
"_LEGACY_ERROR_TEMP_3263" : { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Need to assign one error class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove this
|FROM | ||
| stream_left WATERMARK leftTime DELAY OF INTERVAL 0 second | ||
|FULL OUTER JOIN | ||
| stream_right WATERMARK rightTime DELAY OF INTERVAL 0 second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we require the watermark to be the same between two join sides? what's the corresponding dataframe query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's just for simplification. You can have different definition of watermark including delay threshold. e.g. you can define watermark delay to 10 minutes in stream_left while defining delay to 5 minutes in stream_right.
The corresponding DataFrame query is following:
stream_left <- streaming DF
stream_right <- streaming DF
val leftDf = stream_left.withWatermark("leftTime, "0 second")
val rightDf = stream_right.withWatermark("rightTime", "0 second")
val query = leftDf.join(
rightDf,
expr("leftKey = rightKey AND leftTime BETWEEN rightTime - INTERVAL 5 SECONDS"),
"full_outer")
query.writeStream.blabla.start()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that, with DataFrame API, we can set watermark for all streaming operators, not only the streaming scans. Is this true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can set watermark "before" the first stateful operator has appeared. But conceptually, it is advised to define watermark closer to stream scan. The concept of watermark is to mark that the engine has seen (processed) all the data before the timestamp ts, which is closely related to the input data from stream scan.
optionsClause? sample? tableAlias #tableName | ||
| LEFT_PAREN query RIGHT_PAREN sample? tableAlias #aliasedQuery | ||
| LEFT_PAREN relation RIGHT_PAREN sample? tableAlias #aliasedRelation | ||
optionsClause? sample? watermarkClause? tableAlias #tableName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
watermarkClause
is already defined in streamRelationPrimary
. Do we still need it here? Is it also applied for non-stream relation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relation here can be a temp view which could be technically streaming without STREAM keyword
: funcName=functionName LEFT_PAREN | ||
(functionTableArgument (COMMA functionTableArgument)*)? | ||
RIGHT_PAREN tableAlias | ||
RIGHT_PAREN watermarkClause? tableAlias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the doc https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-watermark, it looks like table_valued_function
has no watermark_clause
support, but we want to have it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch I think it's missed in that doc.
} | ||
|
||
// TODO: This seems to need to find the new expression by index. We can't rely on exprId | ||
// due to UnresolvedAlias. Are there better ways to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a new logical plan like UnresolvedEventTimeWatermark
, which will be rewritten into EventTimeWatermark
+ Project
during analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed. PTAL!
case e: Expression => UnresolvedAlias(e) | ||
} | ||
|
||
val isAttributeReference = namedExpression.isInstanceOf[AttributeReference] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan
Please help verify what I'm doing is correct.
What EventTimeWatermark node does is simply to attach metadata marker against the column (produced by child.output) referred by the expression. The node itself does not do any projection and just updates the accumulators and does the passthrough. If the eventTimeColExpr has to be evaluated, we have to inject Project to evaluate the expression, which we expect the result to be explicitly named.
Hence I skip injecting Project only if the expression is an AttributeReference and the column is available in the child.output, otherwise I simply add Project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't matter. We can also always add the Project, as optimizer can remove it.
override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning( | ||
_.containsPattern(TreePattern.UNRESOLVED_EVENT_TIME_WATERMARK), ruleId) { | ||
|
||
case u: UnresolvedEventTimeWatermark if u.childrenResolved => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we wait for u.eventTimeColExpr.resolved
?
} | ||
|
||
val isAttributeReference = namedExpression.isInstanceOf[AttributeReference] | ||
val exprInChildOutput = u.child.output.exists(_.name == namedExpression.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird to search by name for resolved attributes, we can just do u.child.outputSet.contains(namedExpression)
|
||
final override val nodePatterns: Seq[TreePattern] = Seq(UNRESOLVED_EVENT_TIME_WATERMARK) | ||
|
||
private val delay = IntervalUtils.fromIntervalString(delayInterval.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do it in the parser? This logical plan should take CalendarInterval
directly.
// This is not allowed by default - WatermarkPropagator will throw an exception. We keep the | ||
// logic here because we also maintain the compatibility flag. (See | ||
// SQLConf.STATEFUL_OPERATOR_ALLOW_MULTIPLE for details.) | ||
// Should be defined as lazy so that attributes can be resolved before calling this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need this hack?
What changes were proposed in this pull request?
This PR proposes to introduce WATERMARK clause in SQL statement. WATERMARK clause is to define the watermark against the relation, especially streaming relation where STREAM keyword is added to the relation (or table valued function).
Please refer to the SQL reference doc for WATERMARK clause about its definition.
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-watermark
The PR also contains new tests which show how to use it.
Why are the changes needed?
This is needed to unblock the stateful workload in Streaming Table & Flow being described as SQL statement.
Does this PR introduce any user-facing change?
Yes, users can define the watermark for stateful workloads with SQL statement.
How was this patch tested?
New UTs.
Was this patch authored or co-authored using generative AI tooling?
No.