GH-48206 Fix Statistics logic to enable Parquet DB support on s390x #48207

Vishwanatha-HD · 2025-11-21T15:01:46Z

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the Statistics logic.

What changes are included in this PR?

The fix includes changes to following file:
cpp/src/parquet/statistics.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

GitHub Issue: [C++][Parquet] Fix Statistics logic to enable Parquet DB support on Big-Endian (s390x) systems #48206

github-actions · 2025-11-21T15:02:15Z

⚠️ GitHub issue #48206 has been automatically assigned in GitHub to PR creator.

…390x

kou · 2025-11-22T13:36:20Z

cpp/src/parquet/statistics.cc

+  // Fallback: use encoder for other types
+  auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_);
+  encoder->Put(&src, 1);
+  auto buffer = encoder->FlushValues();
+  dst->assign(reinterpret_cast<const char*>(buffer->data()),
+              static_cast<size_t>(buffer->size()));


Can we reuse the implementation in ARROW_LITTLE_ENDIAN for this?

#if !ARROW_LITTLE_ENDIAN if constexprt (...) { ... } else if ... { ... } #endif auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_); encoder->Put(&src, 1); auto buffer = encoder->FlushValues(); auto ptr = reinterpret_cast<const char*>(buffer->data()); dst->assign(ptr, static_cast<size_t>(buffer->size()));

kou · 2025-11-22T13:40:06Z

cpp/src/parquet/statistics.cc

+  if constexpr (std::is_same_v<DType, Int32Type>) {
+    uint32_t u;
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  } else if constexpr (std::is_same_v<DType, Int64Type>) {
+    uint64_t u;
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  } else if constexpr (std::is_same_v<DType, FloatType>) {
+    uint32_t u;
+    static_assert(sizeof(u) == sizeof(float), "size");
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  } else if constexpr (std::is_same_v<DType, DoubleType>) {
+    uint64_t u;
+    static_assert(sizeof(u) == sizeof(double), "size");
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  }


Can we do this in XXXEncoder::Put() instead of here?

k8ika0s · 2025-11-23T22:40:31Z

@Vishwanatha-HD

Statistics tend to surface all sorts of subtle endian quirks, so it’s always interesting to see how different approaches handle those edge cases.

Running things on s390x, I’ve found that the most stable behavior usually comes from treating every numeric value—whether it’s a 32-bit int, a float, or the three-limb INT96—as if it should be serialized in LE form no matter what the host is doing. Once everything passes through that single normalization step, the defaults, comparisons, and encoder paths all line up cleanly across architectures.

Here, the explicit BE branches for Int32/Int64/Float/Double make the intention clear and should work fine, though it does mean LE and BE end up taking two quite different routes through the code. That can occasionally lead to tiny differences across platforms, especially when stats pages mix types or include INT96 timestamps.

Not raising an issue with the logic—just sharing patterns that have helped keep stats round-trips consistent across hosts.

Vishwanatha-HD requested a review from wgtmac as a code owner November 21, 2025 15:01

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Fix Statistics logic to enable Parquet DB support on Big-Endian (s390x) systems #48206

Open

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 21, 2025

k8ika0s mentioned this pull request Nov 21, 2025

GH-48213: [C++][Parquet] Fix endianness and test failures on s390x (big-endian) (supersedes partial fixes) #48212

Open

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Enable Parquet DB support on Big Endian (IBM Z) systems #48151

Open

apacheGH-48206 Fix Statistics logic to enable Parquet DB support on s…

fc2f62c

…390x

Vishwanatha-HD force-pushed the fixStatistics branch from 69807c0 to fc2f62c Compare November 22, 2025 05:01

kou reviewed Nov 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-48206 Fix Statistics logic to enable Parquet DB support on s390x #48207

GH-48206 Fix Statistics logic to enable Parquet DB support on s390x #48207

Uh oh!

Vishwanatha-HD commented Nov 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

kou Nov 22, 2025

Uh oh!

kou Nov 22, 2025

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GH-48206 Fix Statistics logic to enable Parquet DB support on s390x #48207

Are you sure you want to change the base?

GH-48206 Fix Statistics logic to enable Parquet DB support on s390x #48207

Uh oh!

Conversation

Vishwanatha-HD commented Nov 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

kou Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

kou Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Vishwanatha-HD commented Nov 21, 2025 •

edited by github-actions bot

Loading