-
Notifications
You must be signed in to change notification settings - Fork 236
feat: pass the ordering information to native Scan #2375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: pass the ordering information to native Scan #2375
Conversation
This can cause problems if spark says something is sorted while we don't sort it. for example shuffle files in spark are sorted, but ours are not, so we should make sure that the sort is used correctly.
@andygrove can you please start the CI? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2375 +/- ##
============================================
+ Coverage 56.12% 57.47% +1.35%
- Complexity 976 1297 +321
============================================
Files 119 147 +28
Lines 11743 13438 +1695
Branches 2251 2353 +102
============================================
+ Hits 6591 7724 +1133
- Misses 4012 4452 +440
- Partials 1140 1262 +122 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Oleks V <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting the CI, the first pass looks good to me, I'll check again later today.
@@ -48,13 +48,14 @@ object CometExecUtils { | |||
* partition. The limit operation is performed on the native side. | |||
*/ | |||
def getNativeLimitRDD( | |||
child: SparkPlan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusing IMO. child
is the plan and childPlan
in fact is data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will rename
This is nice to have sort info in sync from Spark caller to the native code, to check benefits would be great to see if any of TPCH queries got faster |
Which issue does this PR close?
N/A
Rationale for this change
Sort information can be used in specialized implementation (for example sort will not sort if the input is already sorted, hash aggregate will use GroupValues that are tracking new groups once they saw the next value)
What changes are included in this PR?
Used the child output ordering
How are these changes tested?
Existing tests
This can cause problems if spark says something is sorted while we don't sort it.
for example shuffle files in spark are sorted, but ours are not, so we should make sure that the sort is used correctly.