Skip to content

[SPARK-52895][SQL] Don't add duplicate elements in resolveExprsWithAggregate #51567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

mihailotim-db
Copy link
Contributor

@mihailotim-db mihailotim-db commented Jul 18, 2025

What changes were proposed in this pull request?

Don't add duplicate elements in resolveExprsWithAggregate.

Why are the changes needed?

This is needed in order to resolve plan mismatches between fixed-point and single-pass analyzer. At the moment fixed-point duplicates columns if there are duplicate columns missing in HAVING/ORDER BY. However, if there are LCAs, fixed-point will deduplicate these columns because LCA resolution uses a set (and LCA resolution runs after ORDER BY/HAVING resolution in fixed-point). In single-pass LCA resolution is done first and only after comes ORDER BY/HAVING resolution which adds duplicates. This PR makes behavior consistent across all cases by never adding duplicates.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new test cases to golden files.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jul 18, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/deduplicate_agg_exprs branch 2 times, most recently from 7a473ce to c34456e Compare July 20, 2025 07:28
@mihailotim-db mihailotim-db changed the title fix [SPARK-52895][SQL] Don't add duplicate elements in resolveExprsWithAggregate Jul 20, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/deduplicate_agg_exprs branch from c34456e to 3a3b774 Compare July 20, 2025 07:47
@@ -2919,21 +2919,21 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
def resolveExprsWithAggregate(
exprs: Seq[Expression],
agg: Aggregate): (Seq[NamedExpression], Seq[Expression]) = {
val extraAggExprs = ArrayBuffer.empty[NamedExpression]
val extraAggExprs = new LinkedHashSet[NamedExpression]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use ExpressionSet which deduplicates expressions by their semantics instead of object equality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, but we can't actually use ExpressionSet since we need to keep the ordering in aggregate list deterministic. I actually found a separate bug here: #51557 and if we use a LinkedHashMap with cannonicalized expression as key, we can solve both issues.

@mihailotim-db mihailotim-db force-pushed the mihailotim-db/deduplicate_agg_exprs branch from 3a3b774 to d423f26 Compare July 21, 2025 05:45
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/deduplicate_agg_exprs branch from d423f26 to 4c24e34 Compare July 21, 2025 05:50
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in c67a774 Jul 21, 2025
haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
…ggregate`

### What changes were proposed in this pull request?
Don't add duplicate elements in `resolveExprsWithAggregate`.

### Why are the changes needed?
This is needed in order to resolve plan mismatches between fixed-point and single-pass analyzer. At the moment fixed-point duplicates columns if there are duplicate columns missing in HAVING/ORDER BY. However, if there are LCAs, fixed-point will deduplicate these columns because LCA resolution uses a set (and LCA resolution runs after ORDER BY/HAVING resolution in fixed-point). In single-pass LCA resolution is done first and only after comes ORDER BY/HAVING resolution which adds duplicates. This PR makes behavior consistent across all cases by never adding duplicates.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new test cases to golden files.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51567 from mihailotim-db/mihailotim-db/deduplicate_agg_exprs.

Authored-by: Mihailo Timotic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants