-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Fix SELECT DISTINCT with delimited alias in ORDER BY bug #27004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix SELECT DISTINCT with delimited alias in ORDER BY bug #27004
Conversation
The canonical value of a non-delimited Identifier was the upper-case value. The canonical value of a delimited Identifier was simply the value. Thus, statements like SELECT DISTINCT a as x FROM (VALUES 2, 1, 2) t(a) ORDER BY "x" would fail, as the canonical value of the identifier from SELECT DISTINCT would be "X", while the canonical value of the identifier from ORDER BY would be "x", and the identifiers would not match.
Reviewer's GuideImplement case-insensitive handling of delimited aliases in SELECT DISTINCT with ORDER BY by extending identifier canonicalization, updating analyzer lookup logic, and adding comprehensive tests. Sequence diagram for case-insensitive alias matching in SELECT DISTINCT with ORDER BYsequenceDiagram
participant "StatementAnalyzer"
participant "CanonicalizationAware"
participant "Identifier"
"StatementAnalyzer"->>"CanonicalizationAware": canonicalizationAwareKey(expression, true)
"CanonicalizationAware"->>"Identifier": getCanonicalValue(true)
"Identifier"-->>"CanonicalizationAware": canonical value (case-insensitive)
"CanonicalizationAware"-->>"StatementAnalyzer": key for alias lookup (case-insensitive)
"StatementAnalyzer"->>"CanonicalizationAware": aliases.contains(key)
"CanonicalizationAware"-->>"StatementAnalyzer": result (match found or not)
Class diagram for updated Identifier and CanonicalizationAware classesclassDiagram
class Identifier {
-String value
-boolean delimited
+boolean isDelimited()
+String getCanonicalValue()
+String getCanonicalValue(boolean ignoreCase)
}
class CanonicalizationAware {
-T node
-boolean ignoreCase
-int hashCode
+CanonicalizationAware(T node, boolean ignoreCase)
+static canonicalizationAwareKey(T node)
+static canonicalizationAwareKey(T node, boolean ignoreCase)
+T getNode()
+int hashCode()
+boolean equals(Object o)
+static Boolean canonicalizationAwareComparison(Node left, Node right)
+static Boolean canonicalizationAwareIgnoreCaseComparison(Node left, Node right)
+static OptionalInt canonicalizationAwareHash(Node node)
+static OptionalInt canonicalizationAwareIgnoreCaseHash(Node node)
}
CanonicalizationAware <|-- Identifier
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `core/trino-parser/src/main/java/io/trino/sql/tree/Identifier.java:84` </location>
<code_context>
+ return getCanonicalValue(false);
+ }
+
+ public String getCanonicalValue(boolean ignoreCase)
+ {
+ if (!ignoreCase && isDelimited()) {
</code_context>
<issue_to_address>
**issue (bug_risk):** The logic for case handling in getCanonicalValue(boolean ignoreCase) may be inconsistent for delimited identifiers.
Delimited identifiers are usually case-sensitive, so converting them to upper case when ignoreCase is true may be incorrect. Please review this logic.
</issue_to_address>
### Comment 2
<location> `core/trino-main/src/main/java/io/trino/sql/analyzer/CanonicalizationAware.java:128` </location>
<code_context>
+ if (node instanceof Identifier identifier) {
+ return OptionalInt.of(identifier.getCanonicalValue(true).hashCode());
+ }
+ if (node.getChildren().isEmpty()) {
+ return OptionalInt.of(node.hashCode());
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Including node.hashCode() for leaf nodes in canonicalizationAwareIgnoreCaseHash may introduce inconsistency.
This mismatch between hashCode and equals for non-Identifier leaf nodes may cause issues in hash-based collections. Please ensure both methods are consistent.
</issue_to_address>
### Comment 3
<location> `core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java:5767` </location>
<code_context>
// the "a" in the SELECT clause is bound to the FROM scope, while the "a" in ORDER BY clause is bound
// to the "a" from the SELECT clause, so we can't compare by field id / relation id.
- if (expression instanceof Identifier && aliases.contains(canonicalizationAwareKey(expression))) {
+ if (expression instanceof Identifier && aliases.contains(canonicalizationAwareKey(expression, true))) {
continue;
}
</code_context>
<issue_to_address>
**question (bug_risk):** The use of ignoreCase=true for all alias comparisons may affect queries with delimited identifiers.
This change may result in incorrect behavior for queries using case-sensitive delimited identifiers. Please review whether this could impact query accuracy.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
if (node instanceof Identifier identifier) { | ||
return OptionalInt.of(identifier.getCanonicalValue(true).hashCode()); | ||
} | ||
if (node.getChildren().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Including node.hashCode() for leaf nodes in canonicalizationAwareIgnoreCaseHash may introduce inconsistency.
This mismatch between hashCode and equals for non-Identifier leaf nodes may cause issues in hash-based collections. Please ensure both methods are consistent.
// the "a" in the SELECT clause is bound to the FROM scope, while the "a" in ORDER BY clause is bound | ||
// to the "a" from the SELECT clause, so we can't compare by field id / relation id. | ||
if (expression instanceof Identifier && aliases.contains(canonicalizationAwareKey(expression))) { | ||
if (expression instanceof Identifier && aliases.contains(canonicalizationAwareKey(expression, true))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (bug_risk): The use of ignoreCase=true for all alias comparisons may affect queries with delimited identifiers.
This change may result in incorrect behavior for queries using case-sensitive delimited identifiers. Please review whether this could impact query accuracy.
return null; | ||
} | ||
|
||
public static Boolean canonicalizationAwareIgnoreCaseComparison(Node left, Node right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are ignoring case comparison - Say if we have a ORDER BY x, "X"
- How would it match ?
private int hashCode; | ||
|
||
private CanonicalizationAware(T node) | ||
private CanonicalizationAware(T node, boolean ignoreCase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't right. The whole purpose of this class is to compare identifiers taking into account SQL canonicalization rules. I.e., delimited identifiers are kept as is, while non-delimited identifiers are canonicalized (to upper-case, per the SQL standard) before comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, if the adopted convention is similar to Oracle (non-delimited identifiers converted to uppercase, delimited identifiers kept as they are), then the changes in this PR are not ok.
The changes in this PR try to follow the Trino documentation https://trino.io/docs/current/language/reserved.html#language-identifiers
"Identifiers are not treated as case sensitive."
I read the "Identifiers" as meaning all identifiers, non-delimited and delimited.
The SELECT (without DISTINCT) follows the convention in the online documentation, by ignoring the case of all identifiers.
The changes here try to bring the SELECT DISTINCT to be similar to SELECT.
I modified the CanonicalizationAware, instead of removing its usage from SELECT DISTINCT, because the Identifier's equals and hashCode consider the case sensitive value.
I will copy this comment on the issue #26755, as the issue is more visible, and this PR seems to be wrong.
Fix SELECT DISTINCT with delimited alias in ORDER BY bug
Fixes #26755
Description
The canonical value of a non-delimited Identifier was the upper-case value. The canonical value of a delimited Identifier was simply the value. Thus, statements like
SELECT DISTINCT a as x FROM (VALUES 2, 1, 2) t(a) ORDER BY "x"
would fail, as the canonical value of the identifier from SELECT DISTINCT would be "X",
while the canonical value of the identifier from ORDER BY would be "x", and the identifiers would not match.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
(X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:
Summary by Sourcery
Fix SELECT DISTINCT alias matching in ORDER BY to handle delimited and mixed-case identifiers by adding ignore-case canonicalization support and updating alias comparison accordingly.
Bug Fixes:
Enhancements:
Tests: