Skip to content

Conversation

@rahulketch
Copy link
Contributor

@rahulketch rahulketch commented Jun 17, 2025

Which issue does this PR close?

Rationale for this change

int96 min/max statistics emitted by arrow-rs are incorrect.

What changes are included in this PR?

  1. Fix the int96 stats
  2. Add round-trip test to verify the behavior

Not included in this PR:

  1. Read stats only from known good writers. This will be implemented after a new arrow-rs release.

Are there any user-facing changes?

The int96 min/max statistics will be different and correct.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 17, 2025
@rahulketch rahulketch changed the title GH-7686 [Parquet] Fix int96 min/max stats GH-7686: [Parquet] Fix int96 min/max stats Jun 17, 2025
@rahulketch rahulketch force-pushed the add-tests-for-int-96-stats branch 3 times, most recently from ede2b9a to 63a5fd5 Compare June 17, 2025 15:55
@rahulketch rahulketch force-pushed the add-tests-for-int-96-stats branch from 63a5fd5 to 6036398 Compare June 17, 2025 16:07
@etseidl
Copy link
Contributor

etseidl commented Jun 17, 2025

I tend to agree with @emkornfield (apache/parquet-java#3243 (comment)) that this is a bit of putting the cart before the horse. The sort order for INT96 is currently undefined so statistics should be ignored. I think we need changes to the Parquet spec before proceeding with this.

@etseidl etseidl added enhancement Any new improvement worthy of a entry in the changelog api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Jun 17, 2025
@etseidl
Copy link
Contributor

etseidl commented Jul 2, 2025

I think we've achieved sufficient consensus to move forward with this. @rahulketch do you have time to address the outstanding issues here (@alamb's suggestion, CI errors, etc)? I have cycles to clean this up if you're short on time. Thanks!

@rahulketch
Copy link
Contributor Author

@etseidl I have pushed the changes which I believe address the open concerns.

There is still the task of adding an allow/deny list to only accept the statistics from known good writer. See the corresponding change in parquet-java.

My suggestion for that is:

  1. Merge this change and make a new arrow-rs release 56.0.0
  2. After the new release add a check which allows reading the stats from:
  • parquet-java 1.15.0+
  • arrow-rs 56.0.0+
  • photon

What do you think?

PS: I will be on vacation 4th July - 15th July, so I'll only be able to make more changes after that.

@etseidl
Copy link
Contributor

etseidl commented Jul 3, 2025

Thanks @rahulketch! I've taken the liberty of fixing the remaining lint errors.

There is still the task of adding an allow/deny list to only accept the statistics from known good writer.

I agree that this is a follow on task and can be done after this PR merges. I think it will need to be part of a larger review of statistics handling to see if other types with undefined sort orders are also ignored.

@etseidl etseidl marked this pull request as ready for review July 3, 2025 13:54
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go now. @alamb @emkornfield can you take another look? 🙏

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @rahulketch and @etseidl

Co-authored-by: Alkis Evlogimenos <[email protected]>
@alamb alamb requested a review from emkornfield July 14, 2025 15:29
@alamb
Copy link
Contributor

alamb commented Jul 14, 2025

@emkornfield this PR looks good to merge from my perspective and has several approvals and I think addresses your feedback.

If you would like more time to review, please let us know, otherwise I plan to merge this PR tomorrow

@emkornfield
Copy link
Contributor

At the last sync we went back and forth on what was required for this from a spec perspective.

I think general consensus is we should update the spec (I think there is a PR open for this).

Beyond that there was discussion on whether we should have a new sort order or just rely on versions. I'll start a thread but #7909 is pertinent to a discussion on on how much effort adding a SortOrder would be.

@alamb
Copy link
Contributor

alamb commented Jul 14, 2025

@emkornfield are you opposed to merging this PR? I can't quite tell from the comments

I think general consensus is we should update the spec (I think there is a PR open for this).

I did make a PR here to clarify the spec about Int96 stats (with a recommendation, not with any actual change):

Beyond that there was discussion on whether we should have a new sort order or just rely on versions. I'll start a thread but #7909 is pertinent to a discussion on on how much effort adding a SortOrder would be.

In my opinion, there is no reason to hold up this PR (which improves compatibility for an "implementation defined" part of the spec) on actually changing the spec. We can proceed with actually defining a SortOrder / fixing the spec in parallel

@emkornfield
Copy link
Contributor

Yes, I'm fine to merge this PR. It seems like a strict improvement.

old_format,
),
Type::INT96 => {
// INT96 statistics may not be correct, because comparison is signed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove this until we have a filter on known good statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// INT96 statistics may not be correct, because comparison is signed
// INT96 statistics may not be correct, because comparison is signed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to merge this one. so I made a follow on PR:

@alamb alamb requested a review from emkornfield July 22, 2025 22:01
@alamb alamb dismissed emkornfield’s stale review July 22, 2025 22:03

Per comment, Micah is satisfied with this PR
#7687 (comment)

@alamb alamb merged commit dff67c9 into apache:main Jul 22, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 22, 2025

Thanks again everyone for all your help and patience

alamb added a commit that referenced this pull request Jul 23, 2025
# Which issue does this PR close?


- Follow on to #7687

# Rationale for this change

I merged #7687 without addressing
one of @emkornfield 's suggestions:
https://github.com/apache/arrow-rs/pull/7687/files#r2205393903

# What changes are included in this PR?

Implement the suggestion (restore a comment_

# Are these changes tested?

 BY CI

# Are there any user-facing changes?

No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet: Incorrect min/max stats for int96 columns

7 participants