-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52895][SQL] Don't add duplicate elements in resolveExprsWithAggregate
#51567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-52895][SQL] Don't add duplicate elements in resolveExprsWithAggregate
#51567
Conversation
7a473ce
to
c34456e
Compare
resolveExprsWithAggregate
c34456e
to
3a3b774
Compare
@@ -2919,21 +2919,21 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor | |||
def resolveExprsWithAggregate( | |||
exprs: Seq[Expression], | |||
agg: Aggregate): (Seq[NamedExpression], Seq[Expression]) = { | |||
val extraAggExprs = ArrayBuffer.empty[NamedExpression] | |||
val extraAggExprs = new LinkedHashSet[NamedExpression] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use ExpressionSet
which deduplicates expressions by their semantics instead of object equality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, but we can't actually use ExpressionSet
since we need to keep the ordering in aggregate list deterministic. I actually found a separate bug here: #51557 and if we use a LinkedHashMap
with cannonicalized
expression as key, we can solve both issues.
3a3b774
to
d423f26
Compare
d423f26
to
4c24e34
Compare
thanks, merging to master! |
…ggregate` ### What changes were proposed in this pull request? Don't add duplicate elements in `resolveExprsWithAggregate`. ### Why are the changes needed? This is needed in order to resolve plan mismatches between fixed-point and single-pass analyzer. At the moment fixed-point duplicates columns if there are duplicate columns missing in HAVING/ORDER BY. However, if there are LCAs, fixed-point will deduplicate these columns because LCA resolution uses a set (and LCA resolution runs after ORDER BY/HAVING resolution in fixed-point). In single-pass LCA resolution is done first and only after comes ORDER BY/HAVING resolution which adds duplicates. This PR makes behavior consistent across all cases by never adding duplicates. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test cases to golden files. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51567 from mihailotim-db/mihailotim-db/deduplicate_agg_exprs. Authored-by: Mihailo Timotic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Don't add duplicate elements in
resolveExprsWithAggregate
.Why are the changes needed?
This is needed in order to resolve plan mismatches between fixed-point and single-pass analyzer. At the moment fixed-point duplicates columns if there are duplicate columns missing in HAVING/ORDER BY. However, if there are LCAs, fixed-point will deduplicate these columns because LCA resolution uses a set (and LCA resolution runs after ORDER BY/HAVING resolution in fixed-point). In single-pass LCA resolution is done first and only after comes ORDER BY/HAVING resolution which adds duplicates. This PR makes behavior consistent across all cases by never adding duplicates.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new test cases to golden files.
Was this patch authored or co-authored using generative AI tooling?
No