-
Notifications
You must be signed in to change notification settings - Fork 1.1k
GH-7686: [Parquet] Fix int96 min/max stats #7687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ede2b9a to
63a5fd5
Compare
63a5fd5 to
6036398
Compare
|
I tend to agree with @emkornfield (apache/parquet-java#3243 (comment)) that this is a bit of putting the cart before the horse. The sort order for |
|
I think we've achieved sufficient consensus to move forward with this. @rahulketch do you have time to address the outstanding issues here (@alamb's suggestion, CI errors, etc)? I have cycles to clean this up if you're short on time. Thanks! |
|
@etseidl I have pushed the changes which I believe address the open concerns. There is still the task of adding an allow/deny list to only accept the statistics from known good writer. See the corresponding change in parquet-java. My suggestion for that is:
What do you think? PS: I will be on vacation 4th July - 15th July, so I'll only be able to make more changes after that. |
|
Thanks @rahulketch! I've taken the liberty of fixing the remaining lint errors.
I agree that this is a follow on task and can be done after this PR merges. I think it will need to be part of a larger review of statistics handling to see if other types with undefined sort orders are also ignored. |
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to go now. @alamb @emkornfield can you take another look? 🙏
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @rahulketch and @etseidl
Co-authored-by: Alkis Evlogimenos <[email protected]>
|
@emkornfield this PR looks good to merge from my perspective and has several approvals and I think addresses your feedback. If you would like more time to review, please let us know, otherwise I plan to merge this PR tomorrow |
|
At the last sync we went back and forth on what was required for this from a spec perspective. I think general consensus is we should update the spec (I think there is a PR open for this). Beyond that there was discussion on whether we should have a new sort order or just rely on versions. I'll start a thread but #7909 is pertinent to a discussion on on how much effort adding a SortOrder would be. |
|
@emkornfield are you opposed to merging this PR? I can't quite tell from the comments
I did make a PR here to clarify the spec about Int96 stats (with a recommendation, not with any actual change):
In my opinion, there is no reason to hold up this PR (which improves compatibility for an "implementation defined" part of the spec) on actually changing the spec. We can proceed with actually defining a SortOrder / fixing the spec in parallel |
|
Yes, I'm fine to merge this PR. It seems like a strict improvement. |
| old_format, | ||
| ), | ||
| Type::INT96 => { | ||
| // INT96 statistics may not be correct, because comparison is signed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should remove this until we have a filter on known good statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // INT96 statistics may not be correct, because comparison is signed | |
| // INT96 statistics may not be correct, because comparison is signed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to merge this one. so I made a follow on PR:
Per comment, Micah is satisfied with this PR
#7687 (comment)
|
Thanks again everyone for all your help and patience |
# Which issue does this PR close? - Follow on to #7687 # Rationale for this change I merged #7687 without addressing one of @emkornfield 's suggestions: https://github.com/apache/arrow-rs/pull/7687/files#r2205393903 # What changes are included in this PR? Implement the suggestion (restore a comment_ # Are these changes tested? BY CI # Are there any user-facing changes? No
Which issue does this PR close?
Rationale for this change
int96 min/max statistics emitted by arrow-rs are incorrect.
What changes are included in this PR?
Not included in this PR:
Are there any user-facing changes?
The int96 min/max statistics will be different and correct.