[SPARK-46625][SQL] Push WithCTE into CTEInChildren when materialised by ResolveIdentifierClause#55706
[SPARK-46625][SQL] Push WithCTE into CTEInChildren when materialised by ResolveIdentifierClause#55706stevomitric wants to merge 3 commits intoapache:masterfrom
Conversation
cloud-fan
left a comment
There was a problem hiding this comment.
Problem. WITH t AS (...) INSERT INTO/OVERWRITE TABLE IDENTIFIER(:p) SELECT * FROM t produces a structurally invalid analyzed plan: WithCTE(InsertIntoStatement(...), cteDefs). InsertIntoStatement is a CTEInChildren, and CTESubstitution.withCTEDefs (CTESubstitution.scala:466-471) explicitly maintains the invariant that no WithCTE may wrap a CTEInChildren — but the invariant is decided at substitution time, before ResolveIdentifierClause materializes the leaf PlanWithUnresolvedIdentifier into a CTEInChildren. Downstream consumers that re-analyze the query subtree see orphan CTERelationRefs and trip InlineCTE.buildCTEMap with NoSuchElementException.
Root cause. withIdentClause (AstBuilder.scala:97-107) lifts PlanWithUnresolvedIdentifier above the entire write/CTAS command, so CTESubstitution sees a leaf at that position and can't tell the leaf will resolve into a CTEInChildren. The PR addresses this by re-applying CTESubstitution.withCTEDefs's dispatch after materialization (a post-hoc collapse), but the structural cause is the parser placement — see the inline comment on ResolveIdentifierClause.scala for a redirect to fix this at the source.
| // `InsertIntoStatement`) inside an outer `WithCTE`, push the CTE defs into the command's | ||
| // children - restoring the invariant from `CTESubstitution.withCTEDefs`. | ||
| resolved.resolveOperatorsUpWithPruning(_.containsPattern(CTE)) { | ||
| case WithCTE(c: CTEInChildren, cteDefs) => c.withCTEDefs(cteDefs) |
There was a problem hiding this comment.
This patches the symptom rather than the root cause: withIdentClause lifts PlanWithUnresolvedIdentifier above the whole write/CTAS command, and we then have to undo the placement after materialization. Please change the parser to push the placeholder into the identifier slot of the produced plan (e.g. InsertIntoStatement.table, CreateTableAsSelect.name) instead of wrapping the entire command, and add a parallel handler in this rule — e.g.:
case i @ InsertIntoStatement(p: PlanWithUnresolvedIdentifier, _, _, _, _, _, _, _, _)
if p.identifierExpr.resolved =>
i.copy(table = executor.execute(p.planBuilder.apply(
IdentifierResolution.evalIdentifierExpr(p.identifierExpr), p.children)))Then CTESubstitution sees the actual CTEInChildren from the start and places WithCTE correctly — no post-hoc collapse needed, and the invariant is preserved by construction. The same shape applies to all withIdentClause call sites whose builder produces a CTEInChildren (INSERT, CTAS, RTAS, CACHE TABLE AS — AstBuilder.scala:911-1002, 5640, 5724, 6502). Downstream matchers like case InsertIntoStatement(LogicalRelationWithTable(_), ...) aren't affected because they run after this rule, by which point table is back to a normal resolved relation.
| } | ||
| } | ||
|
|
||
| test("Analyzed plan does not leave WithCTE wrapping a CTEInChildren " + |
There was a problem hiding this comment.
Once the fix moves into the parser per the comment on ResolveIdentifierClause.scala, please broaden the structural test to cover the other affected commands too — at minimum a WITH t AS (...) CREATE TABLE IDENTIFIER(:p) AS SELECT * FROM t variant, since CTAS goes through the same shape. Also worth knowing: the INSERT INTO/OVERWRITE smoke tests above pass even without the fix (eager command execution doesn't hit the re-analysis path that produces the NoSuchElementException); the structural assertion here is what actually anchors the regression.
What changes were proposed in this pull request?
After
ResolveIdentifierClausematerialises aPlanWithUnresolvedIdentifierinto aCTEInChildren(e.g.InsertIntoStatement), collapse any surroundingWithCTE(c: CTEInChildren, defs)intoc.withCTEDefs(defs)- restoring the placement invariantCTESubstitution.withCTEDefsalready enforces at substitution time.Why are the changes needed?
produces an analysed plan with
WithCTEwrapping theInsertIntoStatement- a structurally invalid shape. Plug-in datasources that re-analyse the InsertIntoStatement's query subtree and throwNoSuchElementException: key not found.Bug is zero-day issue since SPARK-46625
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New tests.
Was this patch authored or co-authored using generative AI tooling?
Yes.