[SPARK-52895][SQL] Don't add duplicate elements in `resolveExprsWithAggregate` #51567

mihailotim-db · 2025-07-18T18:56:17Z

What changes were proposed in this pull request?

Don't add duplicate elements in resolveExprsWithAggregate.

Why are the changes needed?

This is needed in order to resolve plan mismatches between fixed-point and single-pass analyzer. At the moment fixed-point duplicates columns if there are duplicate columns missing in HAVING/ORDER BY. However, if there are LCAs, fixed-point will deduplicate these columns because LCA resolution uses a set (and LCA resolution runs after ORDER BY/HAVING resolution in fixed-point). In single-pass LCA resolution is done first and only after comes ORDER BY/HAVING resolution which adds duplicates. This PR makes behavior consistent across all cases by never adding duplicates.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new test cases to golden files.

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2025-07-21T02:41:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -2919,21 +2919,21 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
    def resolveExprsWithAggregate(
        exprs: Seq[Expression],
        agg: Aggregate): (Seq[NamedExpression], Seq[Expression]) = {
-      val extraAggExprs = ArrayBuffer.empty[NamedExpression]
+      val extraAggExprs = new LinkedHashSet[NamedExpression]


shall we use ExpressionSet which deduplicates expressions by their semantics instead of object equality?

This is a good point, but we can't actually use ExpressionSet since we need to keep the ordering in aggregate list deterministic. I actually found a separate bug here: #51557 and if we use a LinkedHashMap with cannonicalized expression as key, we can solve both issues.

cloud-fan · 2025-07-21T10:33:54Z

thanks, merging to master!

…ggregate` ### What changes were proposed in this pull request? Don't add duplicate elements in `resolveExprsWithAggregate`. ### Why are the changes needed? This is needed in order to resolve plan mismatches between fixed-point and single-pass analyzer. At the moment fixed-point duplicates columns if there are duplicate columns missing in HAVING/ORDER BY. However, if there are LCAs, fixed-point will deduplicate these columns because LCA resolution uses a set (and LCA resolution runs after ORDER BY/HAVING resolution in fixed-point). In single-pass LCA resolution is done first and only after comes ORDER BY/HAVING resolution which adds duplicates. This PR makes behavior consistent across all cases by never adding duplicates. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test cases to golden files. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51567 from mihailotim-db/mihailotim-db/deduplicate_agg_exprs. Authored-by: Mihailo Timotic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Jul 18, 2025

mihailotim-db force-pushed the mihailotim-db/deduplicate_agg_exprs branch 2 times, most recently from 7a473ce to c34456e Compare July 20, 2025 07:28

mihailotim-db changed the title ~~fix~~ [SPARK-52895][SQL] Don't add duplicate elements in resolveExprsWithAggregate Jul 20, 2025

mihailotim-db force-pushed the mihailotim-db/deduplicate_agg_exprs branch from c34456e to 3a3b774 Compare July 20, 2025 07:47

cloud-fan reviewed Jul 21, 2025

View reviewed changes

mihailotim-db force-pushed the mihailotim-db/deduplicate_agg_exprs branch from 3a3b774 to d423f26 Compare July 21, 2025 05:45

fix

4c24e34

mihailotim-db force-pushed the mihailotim-db/deduplicate_agg_exprs branch from d423f26 to 4c24e34 Compare July 21, 2025 05:50

cloud-fan approved these changes Jul 21, 2025

View reviewed changes

cloud-fan closed this in c67a774 Jul 21, 2025

mihailotim-db mentioned this pull request Jul 21, 2025

[SPARK-52896][SQL] Match attribute ExprId in OuterReference with ExprId of exposed outer attribute #51557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52895][SQL] Don't add duplicate elements in `resolveExprsWithAggregate` #51567

[SPARK-52895][SQL] Don't add duplicate elements in `resolveExprsWithAggregate` #51567

Uh oh!

mihailotim-db commented Jul 18, 2025 •

edited

Loading

Uh oh!

cloud-fan Jul 21, 2025

Uh oh!

mihailotim-db Jul 21, 2025

Uh oh!

cloud-fan commented Jul 21, 2025

Uh oh!

Uh oh!

[SPARK-52895][SQL] Don't add duplicate elements in resolveExprsWithAggregate #51567

[SPARK-52895][SQL] Don't add duplicate elements in resolveExprsWithAggregate #51567

Uh oh!

Conversation

mihailotim-db commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

mihailotim-db Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 21, 2025

Uh oh!

Uh oh!

[SPARK-52895][SQL] Don't add duplicate elements in `resolveExprsWithAggregate` #51567

[SPARK-52895][SQL] Don't add duplicate elements in `resolveExprsWithAggregate` #51567

mihailotim-db commented Jul 18, 2025 •

edited

Loading