[SPARK-56738][SQL] Normalize cteIds of orphan CTERelationRef in NormalizeCTEIds#55708
Closed
yaooqinn wants to merge 1 commit intoapache:masterfrom
Closed
[SPARK-56738][SQL] Normalize cteIds of orphan CTERelationRef in NormalizeCTEIds#55708yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn wants to merge 1 commit intoapache:masterfrom
Conversation
…lizeCTEIds
### What changes were proposed in this pull request?
Extend `NormalizeCTEIds` so that it normalizes `CTERelationRef.cteId` for
"orphan" references whose matching `CTERelationDef` is not present anywhere
in the input plan (i.e., the enclosing `WithCTE` lives outside the sub-tree
being normalized).
The change is intentionally minimal:
- `apply` first walks the plan to collect the ids of every reachable
`CTERelationDef`. References whose `cteId` is in that set are paired with
their def and remain handled by the existing `WithCTE` branch (no behaviour
change). References whose `cteId` is NOT in that set are treated as orphan
and pre-assigned a deterministic id from the same shared counter.
- A new `case ref: CTERelationRef` in `applyInternal` rewrites those orphan
refs to the assigned id; refs paired with a def are rewritten as before
inside `canonicalizeCTE` and never re-rewritten.
### Why are the changes needed?
`CacheManager.{cacheQuery, lookupCachedData}` run `QueryExecution.normalize`
(whose first rule is `NormalizeCTEIds`) so that two parses of the same SQL
canonicalize identically and cached entries can be reused. Today
`NormalizeCTEIds` only rewrites cteIds it discovers inside a `WithCTE` node.
When a caller probes `CacheManager` with a sub-tree that contains a
`CTERelationRef` but the corresponding `CTERelationDef` is not in the
sub-tree, the per-parse cteId leaks into the canonical form and
`sameResult` returns false on a structurally-identical re-parse, breaking
cache lookups for that sub-tree. This patch closes that gap so
`NormalizeCTEIds` produces a parse-stable canonical form for both wrapped
and orphan CTE references.
This fix is scoped to the `NormalizeCTEIds` contract; it does not by itself
remove all sources of canonical instability across parses (e.g. operand
ordering of commutative expressions or canonicalization of subquery
broadcast/dynamic-pruning wrappers), which are tracked separately.
### Does this PR introduce _any_ user-facing change?
No. `NormalizeCTEIds` is an internal plan-normalization rule. Plans without
orphan `CTERelationRef` produce byte-identical output to before; plans with
orphan refs only see their `cteId` rewritten to a deterministic value.
### How was this patch tested?
New `NormalizeCTEIdsSuite` with two unit tests:
- existing behaviour: `WithCTE`-wrapped plans canonicalize identically
across parses (regression guard for the unchanged path).
- new behaviour: orphan `CTERelationRef` sub-trees with structurally
identical bodies but different per-parse `cteId` values canonicalize
identically and are considered `sameResult` after `NormalizeCTEIds`.
New `CacheManagerSuite` test that parses a `WITH inner_cte AS ..., outer_cte
AS (... (SELECT m FROM inner_cte))` SQL twice, extracts `outer_cte`'s body
(an orphan `CTERelationRef` sub-tree) and verifies `QueryExecution.normalize`
produces the same canonical form / `sameResult` across parses.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI, model: Claude Opus 4.7 1M context (claude-opus-4.7-1m-internal)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Extend
NormalizeCTEIdsso that it normalizesCTERelationRef.cteIdfor "orphan" references whose matchingCTERelationDefis not present anywhere in the input plan (i.e., the enclosingWithCTElives outside the sub-tree being normalized).The change is intentionally minimal:
applyfirst walks the plan to collect the ids of every reachableCTERelationDef. References whosecteIdis in that set are paired with their def and remain handled by the existingWithCTEbranch (no behaviour change). References whosecteIdis NOT in that set are treated as orphan and pre-assigned a deterministic id from the same shared counter.case ref: CTERelationRefinapplyInternalrewrites those orphan refs to the assigned id; refs paired with a def are rewritten as before insidecanonicalizeCTEand never re-rewritten.Why are the changes needed?
CacheManager.{cacheQuery, lookupCachedData}runQueryExecution.normalize(whose first rule isNormalizeCTEIds) so that two parses of the same SQL canonicalize identically and cached entries can be reused. TodayNormalizeCTEIdsonly rewrites cteIds it discovers inside aWithCTEnode. When a caller probesCacheManagerwith a sub-tree that contains aCTERelationRefbut the correspondingCTERelationDefis not in the sub-tree, the per-parse cteId leaks into the canonical form andsameResultreturns false on a structurally-identical re-parse, breaking cache lookups for that sub-tree. This patch closes that gap soNormalizeCTEIdsproduces a parse-stable canonical form for both wrapped and orphan CTE references.This fix is scoped to the
NormalizeCTEIdscontract; it does not by itself remove all sources of canonical instability across parses (e.g. operand ordering of commutative expressions or canonicalization of subquery broadcast/dynamic-pruning wrappers), which are tracked separately.Does this PR introduce any user-facing change?
No.
NormalizeCTEIdsis an internal plan-normalization rule. Plans without orphanCTERelationRefproduce byte-identical output to before; plans with orphan refs only see theircteIdrewritten to a deterministic value.How was this patch tested?
New
NormalizeCTEIdsSuitewith two unit tests:WithCTE-wrapped plans canonicalize identically across parses (regression guard for the unchanged path).CTERelationRefsub-trees with structurally identical bodies but different per-parsecteIdvalues canonicalize identically and are consideredsameResultafterNormalizeCTEIds.New
CacheManagerSuitetest that parses aWITH inner_cte AS ..., outer_cte AS (... (SELECT m FROM inner_cte))SQL twice, extractsouter_cte's body (an orphanCTERelationRefsub-tree) and verifiesQueryExecution.normalizeproduces the same canonical form /sameResultacross parses.Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI, model: Claude Opus 4.7 1M context (claude-opus-4.7-1m-internal)