Skip to content

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Sep 1, 2025

Rationale for this change

Currently we drop all statistics if SortOrder is UNKNOWN. This seems too broad and there are some statistics, like null_count that could be maintained.

// If the column statistics don't exist or column sort order is unknown
// we cannot use the column stats
if (!column_metadata_->__isset.statistics ||
descr_->sort_order() == SortOrder::UNKNOWN) {
return false;
}

Clearing min/max but allowing to keep null_count when SortOrder is UNKNOWN would allow users to use them.

What changes are included in this PR?

Maintain Statistics when reading them if SortOrder::UNKNOWK but clear min/max

Are these changes tested?

Yes, there is a file on parquet-testing which allows us to validate this exact scenario.

Are there any user-facing changes?

No changes to APIs, users will be able to read statistics on this case.

Comment on lines 999 to 1000
} else if (SortOrder::UNKNOWN == sort_order) {
return nullptr;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem right because we were previously returning an Exception. The problem is that if ApplicationVersion::HasCorrectStatistics returns true when SortOrder::UNKNOWN if we want to generate the Typed Statistics a call to DoMakeComparator is expected to return and not raise an Exception as is currently happening.

Apart from ideas on what could be the best approach about that, this open the question of the writer too. Should we also write Statistics for columns with SortOrder::UNKNOWN but not compute min/max? Maybe as a different issue is it worth it?

@pitrou what are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I looked at this and am coming to the following suggestion:

  1. DoMakeComparator should keep throwing an exception instead of returning nullptr
  2. TypedStatisticsImpl should catch the error when calling MakeComparator and then simply set the comparator to nullptr (methods using the comparator should be changed to be no-ops when that happens)
  3. Also, TypedStatisticsImpl only uses the comparator when updating/writing statistics, so it could be probably be created lazily, not in the constructor? (but this is more of an improvement and not strictly necessary)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 2, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 9, 2025
@raulcd raulcd marked this pull request as ready for review September 9, 2025 09:24
@raulcd raulcd requested a review from wgtmac as a code owner September 9, 2025 09:24
try {
comparator_ = MakeComparator<DType>(descr);
} catch (const ParquetException&) {
comparator_ = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add an error log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure we want to add an error log as, yes, it is an error building the comparator, because SortOrder is UNKNOWN and we can't build a comparator but it will be a valid use case of generating statistics (we will just not set min/max) so I don't think we want to add an error log. What do you think?
See the DoMakeComparator function:

std::shared_ptr<Comparator> DoMakeComparator(Type::type physical_type,
LogicalType::Type::type logical_type,
SortOrder::type sort_order,
int type_length) {
if (SortOrder::SIGNED == sort_order) {
switch (physical_type) {
case Type::BOOLEAN:
return std::make_shared<TypedComparatorImpl<true, BooleanType>>();
case Type::INT32:
return std::make_shared<TypedComparatorImpl<true, Int32Type>>();
case Type::INT64:
return std::make_shared<TypedComparatorImpl<true, Int64Type>>();
case Type::INT96:
return std::make_shared<TypedComparatorImpl<true, Int96Type>>();
case Type::FLOAT:
return std::make_shared<TypedComparatorImpl<true, FloatType>>();
case Type::DOUBLE:
return std::make_shared<TypedComparatorImpl<true, DoubleType>>();
case Type::BYTE_ARRAY:
return std::make_shared<TypedComparatorImpl<true, ByteArrayType>>();
case Type::FIXED_LEN_BYTE_ARRAY:
if (logical_type == LogicalType::Type::FLOAT16) {
return std::make_shared<TypedComparatorImpl<true, Float16LogicalType>>(
type_length);
}
return std::make_shared<TypedComparatorImpl<true, FLBAType>>(type_length);
default:
ParquetException::NYI("Signed Compare not implemented");
}
} else if (SortOrder::UNSIGNED == sort_order) {
switch (physical_type) {
case Type::INT32:
return std::make_shared<TypedComparatorImpl<false, Int32Type>>();
case Type::INT64:
return std::make_shared<TypedComparatorImpl<false, Int64Type>>();
case Type::INT96:
return std::make_shared<TypedComparatorImpl<false, Int96Type>>();
case Type::BYTE_ARRAY:
return std::make_shared<TypedComparatorImpl<false, ByteArrayType>>();
case Type::FIXED_LEN_BYTE_ARRAY:
return std::make_shared<TypedComparatorImpl<false, FLBAType>>(type_length);
default:
ParquetException::NYI("Unsigned Compare not implemented");
}
} else {
throw ParquetException("UNKNOWN Sort Order");
}
return nullptr;
}

@@ -573,7 +578,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
min_buffer_(AllocateBuffer(pool_, 0)),
max_buffer_(AllocateBuffer(pool_, 0)),
logical_type_(LogicalTypeId(descr_)) {
comparator_ = MakeComparator<DType>(descr);
try {
comparator_ = MakeComparator<DType>(descr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will it throw?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw when generating typed statistics if SortOrder::UNKNOWN but we still want statistics to be generated except min/max. I can add a comment for that case.

Copy link
Member

@pitrou pitrou Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of trying and catching, how about just not calling MakeComparator if the sort order is Unknown?

(sorry, I realize I suggested try/catch above; both are ok to me, by the way)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the early check is better and more explicit when reading the code, otherwise we might want to add the comment. I'll change it.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 12, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 12, 2025
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 15, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 15, 2025
@raulcd raulcd requested review from pitrou and wgtmac September 16, 2025 06:54
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @raulcd . @wgtmac Do you want to take a last look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants