Skip to content

Conversation

HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to introduce WATERMARK clause in SQL statement. WATERMARK clause is to define the watermark against the relation, especially streaming relation where STREAM keyword is added to the relation (or table valued function).

Please refer to the SQL reference doc for WATERMARK clause about its definition.
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-watermark

The PR also contains new tests which show how to use it.

Why are the changes needed?

This is needed to unblock the stateful workload in Streaming Table & Flow being described as SQL statement.

Does this PR introduce any user-facing change?

Yes, users can define the watermark for stateful workloads with SQL statement.

How was this patch tested?

New UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

@HeartSaVioR
Copy link
Contributor Author

cc. @sryza @cloud-fan @viirya
Could you please take a look? Thanks!

Also, do we have test purposed TVF definition for "streaming source"? I couldn't add E2E test for STREAM TVF + WATERMARK and it'd be ideal if we could add one.

@HeartSaVioR HeartSaVioR changed the title [SPARK-53687][SQL][SS] Introduce WATERMARK clause in SQL statement [SPARK-53687][SQL][SS][SDP] Introduce WATERMARK clause in SQL statement Sep 24, 2025
@HeartSaVioR
Copy link
Contributor Author

cc. @anishshri-db as well

"Doesn't support month or year interval: <interval>"
]
},
"_LEGACY_ERROR_TEMP_3263" : {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Need to assign one error class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove this

@github-actions github-actions bot added the DOCS label Sep 24, 2025
|FROM
| stream_left WATERMARK leftTime DELAY OF INTERVAL 0 second
|FULL OUTER JOIN
| stream_right WATERMARK rightTime DELAY OF INTERVAL 0 second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we require the watermark to be the same between two join sides? what's the corresponding dataframe query?

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's just for simplification. You can have different definition of watermark including delay threshold. e.g. you can define watermark delay to 10 minutes in stream_left while defining delay to 5 minutes in stream_right.

The corresponding DataFrame query is following:

stream_left <- streaming DF
stream_right <- streaming DF

val leftDf = stream_left.withWatermark("leftTime, "0 second")
val rightDf = stream_right.withWatermark("rightTime", "0 second")

val query = leftDf.join(
  rightDf,
  expr("leftKey = rightKey AND leftTime BETWEEN rightTime - INTERVAL 5 SECONDS"),
  "full_outer")

query.writeStream.blabla.start()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that, with DataFrame API, we can set watermark for all streaming operators, not only the streaming scans. Is this true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set watermark "before" the first stateful operator has appeared. But conceptually, it is advised to define watermark closer to stream scan. The concept of watermark is to mark that the engine has seen (processed) all the data before the timestamp ts, which is closely related to the input data from stream scan.

optionsClause? sample? tableAlias #tableName
| LEFT_PAREN query RIGHT_PAREN sample? tableAlias #aliasedQuery
| LEFT_PAREN relation RIGHT_PAREN sample? tableAlias #aliasedRelation
optionsClause? sample? watermarkClause? tableAlias #tableName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watermarkClause is already defined in streamRelationPrimary. Do we still need it here? Is it also applied for non-stream relation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relation here can be a temp view which could be technically streaming without STREAM keyword

: funcName=functionName LEFT_PAREN
(functionTableArgument (COMMA functionTableArgument)*)?
RIGHT_PAREN tableAlias
RIGHT_PAREN watermarkClause? tableAlias
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the doc https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-watermark, it looks like table_valued_function has no watermark_clause support, but we want to have it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch I think it's missed in that doc.

}

// TODO: This seems to need to find the new expression by index. We can't rely on exprId
// due to UnresolvedAlias. Are there better ways to do?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a new logical plan like UnresolvedEventTimeWatermark, which will be rewritten into EventTimeWatermark + Project during analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. PTAL!

case e: Expression => UnresolvedAlias(e)
}

val isAttributeReference = namedExpression.isInstanceOf[AttributeReference]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Please help verify what I'm doing is correct.

What EventTimeWatermark node does is simply to attach metadata marker against the column (produced by child.output) referred by the expression. The node itself does not do any projection and just updates the accumulators and does the passthrough. If the eventTimeColExpr has to be evaluated, we have to inject Project to evaluate the expression, which we expect the result to be explicitly named.

Hence I skip injecting Project only if the expression is an AttributeReference and the column is available in the child.output, otherwise I simply add Project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't matter. We can also always add the Project, as optimizer can remove it.

override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning(
_.containsPattern(TreePattern.UNRESOLVED_EVENT_TIME_WATERMARK), ruleId) {

case u: UnresolvedEventTimeWatermark if u.childrenResolved =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we wait for u.eventTimeColExpr.resolved?

}

val isAttributeReference = namedExpression.isInstanceOf[AttributeReference]
val exprInChildOutput = u.child.output.exists(_.name == namedExpression.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's weird to search by name for resolved attributes, we can just do u.child.outputSet.contains(namedExpression)


final override val nodePatterns: Seq[TreePattern] = Seq(UNRESOLVED_EVENT_TIME_WATERMARK)

private val delay = IntervalUtils.fromIntervalString(delayInterval.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do it in the parser? This logical plan should take CalendarInterval directly.

// This is not allowed by default - WatermarkPropagator will throw an exception. We keep the
// logic here because we also maintain the compatibility flag. (See
// SQLConf.STATEFUL_OPERATOR_ALLOW_MULTIPLE for details.)
// Should be defined as lazy so that attributes can be resolved before calling this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this hack?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants