-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-47449: [C++][Parquet] Do not drop all Statistics if SortOrder is UNKNOWN #47466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cpp/src/parquet/statistics.cc
Outdated
} else if (SortOrder::UNKNOWN == sort_order) { | ||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem right because we were previously returning an Exception. The problem is that if ApplicationVersion::HasCorrectStatistics
returns true when SortOrder::UNKNOWN
if we want to generate the Typed Statistics a call to DoMakeComparator
is expected to return and not raise an Exception as is currently happening.
Apart from ideas on what could be the best approach about that, this open the question of the writer too. Should we also write Statistics for columns with SortOrder::UNKNOWN
but not compute min/max? Maybe as a different issue is it worth it?
@pitrou what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I looked at this and am coming to the following suggestion:
DoMakeComparator
should keep throwing an exception instead of returningnullptr
TypedStatisticsImpl
should catch the error when callingMakeComparator
and then simply set the comparator tonullptr
(methods using the comparator should be changed to be no-ops when that happens)- Also,
TypedStatisticsImpl
only uses the comparator when updating/writing statistics, so it could be probably be created lazily, not in the constructor? (but this is more of an improvement and not strictly necessary)
…op the usage of comparator
cpp/src/parquet/statistics.cc
Outdated
try { | ||
comparator_ = MakeComparator<DType>(descr); | ||
} catch (const ParquetException&) { | ||
comparator_ = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add an error log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure we want to add an error log as, yes, it is an error building the comparator, because SortOrder is UNKNOWN and we can't build a comparator but it will be a valid use case of generating statistics (we will just not set min/max) so I don't think we want to add an error log. What do you think?
See the DoMakeComparator
function:
arrow/cpp/src/parquet/statistics.cc
Lines 950 to 998 in d2dace9
std::shared_ptr<Comparator> DoMakeComparator(Type::type physical_type, | |
LogicalType::Type::type logical_type, | |
SortOrder::type sort_order, | |
int type_length) { | |
if (SortOrder::SIGNED == sort_order) { | |
switch (physical_type) { | |
case Type::BOOLEAN: | |
return std::make_shared<TypedComparatorImpl<true, BooleanType>>(); | |
case Type::INT32: | |
return std::make_shared<TypedComparatorImpl<true, Int32Type>>(); | |
case Type::INT64: | |
return std::make_shared<TypedComparatorImpl<true, Int64Type>>(); | |
case Type::INT96: | |
return std::make_shared<TypedComparatorImpl<true, Int96Type>>(); | |
case Type::FLOAT: | |
return std::make_shared<TypedComparatorImpl<true, FloatType>>(); | |
case Type::DOUBLE: | |
return std::make_shared<TypedComparatorImpl<true, DoubleType>>(); | |
case Type::BYTE_ARRAY: | |
return std::make_shared<TypedComparatorImpl<true, ByteArrayType>>(); | |
case Type::FIXED_LEN_BYTE_ARRAY: | |
if (logical_type == LogicalType::Type::FLOAT16) { | |
return std::make_shared<TypedComparatorImpl<true, Float16LogicalType>>( | |
type_length); | |
} | |
return std::make_shared<TypedComparatorImpl<true, FLBAType>>(type_length); | |
default: | |
ParquetException::NYI("Signed Compare not implemented"); | |
} | |
} else if (SortOrder::UNSIGNED == sort_order) { | |
switch (physical_type) { | |
case Type::INT32: | |
return std::make_shared<TypedComparatorImpl<false, Int32Type>>(); | |
case Type::INT64: | |
return std::make_shared<TypedComparatorImpl<false, Int64Type>>(); | |
case Type::INT96: | |
return std::make_shared<TypedComparatorImpl<false, Int96Type>>(); | |
case Type::BYTE_ARRAY: | |
return std::make_shared<TypedComparatorImpl<false, ByteArrayType>>(); | |
case Type::FIXED_LEN_BYTE_ARRAY: | |
return std::make_shared<TypedComparatorImpl<false, FLBAType>>(type_length); | |
default: | |
ParquetException::NYI("Unsigned Compare not implemented"); | |
} | |
} else { | |
throw ParquetException("UNKNOWN Sort Order"); | |
} | |
return nullptr; | |
} |
@@ -573,7 +578,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> { | |||
min_buffer_(AllocateBuffer(pool_, 0)), | |||
max_buffer_(AllocateBuffer(pool_, 0)), | |||
logical_type_(LogicalTypeId(descr_)) { | |||
comparator_ = MakeComparator<DType>(descr); | |||
try { | |||
comparator_ = MakeComparator<DType>(descr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will it throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will throw when generating typed statistics if SortOrder::UNKNOWN
but we still want statistics to be generated except min/max. I can add a comment for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of trying and catching, how about just not calling MakeComparator
if the sort order is Unknown?
(sorry, I realize I suggested try/catch above; both are ok to me, by the way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the early check is better and more explicit when reading the code, otherwise we might want to add the comment. I'll change it.
Co-authored-by: Gang Wu <[email protected]>
…e creating comparator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rationale for this change
Currently we drop all statistics if
SortOrder
isUNKNOWN
. This seems too broad and there are some statistics, likenull_count
that could be maintained.arrow/cpp/src/parquet/metadata.cc
Lines 330 to 335 in 6f6138b
Clearing
min/max
but allowing to keepnull_count
whenSortOrder
isUNKNOWN
would allow users to use them.What changes are included in this PR?
Maintain Statistics when reading them if
SortOrder::UNKNOWK
but clear min/maxAre these changes tested?
Yes, there is a file on parquet-testing which allows us to validate this exact scenario.
Are there any user-facing changes?
No changes to APIs, users will be able to read statistics on this case.