Skip to content

[SPARK-56738][SQL] Normalize cteIds of orphan CTERelationRef in NormalizeCTEIds#55708

Closed
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-56738
Closed

[SPARK-56738][SQL] Normalize cteIds of orphan CTERelationRef in NormalizeCTEIds#55708
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-56738

Conversation

@yaooqinn
Copy link
Copy Markdown
Member

@yaooqinn yaooqinn commented May 6, 2026

What changes were proposed in this pull request?

Extend NormalizeCTEIds so that it normalizes CTERelationRef.cteId for "orphan" references whose matching CTERelationDef is not present anywhere in the input plan (i.e., the enclosing WithCTE lives outside the sub-tree being normalized).

The change is intentionally minimal:

  • apply first walks the plan to collect the ids of every reachable CTERelationDef. References whose cteId is in that set are paired with their def and remain handled by the existing WithCTE branch (no behaviour change). References whose cteId is NOT in that set are treated as orphan and pre-assigned a deterministic id from the same shared counter.
  • A new case ref: CTERelationRef in applyInternal rewrites those orphan refs to the assigned id; refs paired with a def are rewritten as before inside canonicalizeCTE and never re-rewritten.

Why are the changes needed?

CacheManager.{cacheQuery, lookupCachedData} run QueryExecution.normalize (whose first rule is NormalizeCTEIds) so that two parses of the same SQL canonicalize identically and cached entries can be reused. Today NormalizeCTEIds only rewrites cteIds it discovers inside a WithCTE node. When a caller probes CacheManager with a sub-tree that contains a CTERelationRef but the corresponding CTERelationDef is not in the sub-tree, the per-parse cteId leaks into the canonical form and sameResult returns false on a structurally-identical re-parse, breaking cache lookups for that sub-tree. This patch closes that gap so NormalizeCTEIds produces a parse-stable canonical form for both wrapped and orphan CTE references.

This fix is scoped to the NormalizeCTEIds contract; it does not by itself remove all sources of canonical instability across parses (e.g. operand ordering of commutative expressions or canonicalization of subquery broadcast/dynamic-pruning wrappers), which are tracked separately.

Does this PR introduce any user-facing change?

No. NormalizeCTEIds is an internal plan-normalization rule. Plans without orphan CTERelationRef produce byte-identical output to before; plans with orphan refs only see their cteId rewritten to a deterministic value.

How was this patch tested?

New NormalizeCTEIdsSuite with two unit tests:

  • existing behaviour: WithCTE-wrapped plans canonicalize identically across parses (regression guard for the unchanged path).
  • new behaviour: orphan CTERelationRef sub-trees with structurally identical bodies but different per-parse cteId values canonicalize identically and are considered sameResult after NormalizeCTEIds.

New CacheManagerSuite test that parses a WITH inner_cte AS ..., outer_cte AS (... (SELECT m FROM inner_cte)) SQL twice, extracts outer_cte's body (an orphan CTERelationRef sub-tree) and verifies QueryExecution.normalize produces the same canonical form / sameResult across parses.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI, model: Claude Opus 4.7 1M context (claude-opus-4.7-1m-internal)

…lizeCTEIds

### What changes were proposed in this pull request?

Extend `NormalizeCTEIds` so that it normalizes `CTERelationRef.cteId` for
"orphan" references whose matching `CTERelationDef` is not present anywhere
in the input plan (i.e., the enclosing `WithCTE` lives outside the sub-tree
being normalized).

The change is intentionally minimal:
- `apply` first walks the plan to collect the ids of every reachable
  `CTERelationDef`. References whose `cteId` is in that set are paired with
  their def and remain handled by the existing `WithCTE` branch (no behaviour
  change). References whose `cteId` is NOT in that set are treated as orphan
  and pre-assigned a deterministic id from the same shared counter.
- A new `case ref: CTERelationRef` in `applyInternal` rewrites those orphan
  refs to the assigned id; refs paired with a def are rewritten as before
  inside `canonicalizeCTE` and never re-rewritten.

### Why are the changes needed?

`CacheManager.{cacheQuery, lookupCachedData}` run `QueryExecution.normalize`
(whose first rule is `NormalizeCTEIds`) so that two parses of the same SQL
canonicalize identically and cached entries can be reused. Today
`NormalizeCTEIds` only rewrites cteIds it discovers inside a `WithCTE` node.
When a caller probes `CacheManager` with a sub-tree that contains a
`CTERelationRef` but the corresponding `CTERelationDef` is not in the
sub-tree, the per-parse cteId leaks into the canonical form and
`sameResult` returns false on a structurally-identical re-parse, breaking
cache lookups for that sub-tree. This patch closes that gap so
`NormalizeCTEIds` produces a parse-stable canonical form for both wrapped
and orphan CTE references.

This fix is scoped to the `NormalizeCTEIds` contract; it does not by itself
remove all sources of canonical instability across parses (e.g. operand
ordering of commutative expressions or canonicalization of subquery
broadcast/dynamic-pruning wrappers), which are tracked separately.

### Does this PR introduce _any_ user-facing change?

No. `NormalizeCTEIds` is an internal plan-normalization rule. Plans without
orphan `CTERelationRef` produce byte-identical output to before; plans with
orphan refs only see their `cteId` rewritten to a deterministic value.

### How was this patch tested?

New `NormalizeCTEIdsSuite` with two unit tests:
- existing behaviour: `WithCTE`-wrapped plans canonicalize identically
  across parses (regression guard for the unchanged path).
- new behaviour: orphan `CTERelationRef` sub-trees with structurally
  identical bodies but different per-parse `cteId` values canonicalize
  identically and are considered `sameResult` after `NormalizeCTEIds`.

New `CacheManagerSuite` test that parses a `WITH inner_cte AS ..., outer_cte
AS (... (SELECT m FROM inner_cte))` SQL twice, extracts `outer_cte`'s body
(an orphan `CTERelationRef` sub-tree) and verifies `QueryExecution.normalize`
produces the same canonical form / `sameResult` across parses.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI, model: Claude Opus 4.7 1M context (claude-opus-4.7-1m-internal)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yaooqinn yaooqinn closed this May 6, 2026
@yaooqinn yaooqinn deleted the SPARK-56738 branch May 6, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant