GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x #48205

Vishwanatha-HD · 2025-11-21T14:18:53Z

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the column reader & writer logic. Column Reader & Writer are the main part of most of the parquet & arrow-parquet testcases.

What changes are included in this PR?

The fix includes changes to following files:
cpp/src/parquet/column_reader.cc
cpp/src/parquet/column_writer.cc
cpp/src/parquet/column_writer.h

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

GitHub Issue: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on Big-Endian (s390x) systems #48204

github-actions · 2025-11-21T14:19:19Z

⚠️ GitHub issue #48204 has been automatically assigned in GitHub to PR creator.

…support on s390x

kou · 2025-11-22T12:49:23Z

cpp/src/parquet/column_writer.h

  int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
  (*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif


Do we need this #if?

It seems that the below (*impala_timestamp).value[2] = static_cast<uint32_t>(julian_days); does the same thing.

kou · 2025-11-22T13:01:21Z

cpp/src/parquet/column_writer.h

  auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
  // impala_timestamp will be unaligned every other entry so do memcpy instead
  // of assign and reinterpret cast to avoid undefined behavior.
  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast<uint32_t>(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast<uint32_t>(last_day_nanos >> 32);


Can we use the following instead of #if?

auto last_day_nanos = last_day_units * NanosecondsPerUnit; auto last_day_nanos_little_endian = ::arrow::bit_util::ToLittleEndian(last_day_nanos); std::memcpy(impala_timestamp, &last_day_nanos_little_endian, sizeof(int64_t));

kou · 2025-11-22T13:22:59Z

cpp/src/parquet/column_writer.cc

+#else
+template <>
+struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
+  Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*, FLBA* out) {
+    const uint16_t* values = array.raw_values();
+    const int64_t length = array.length();
+
+    // Allocate buffer for little-endian converted values
+    converted_values_.resize(length);
+
+    if (array.null_count() == 0) {
+      for (int64_t i = 0; i < length; ++i) {
+        converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+        out[i] = FLBA{reinterpret_cast<const uint8_t*>(&converted_values_[i])};
+      }
+    } else {
+      for (int64_t i = 0; i < length; ++i) {
+        if (array.IsValid(i)) {
+          converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+          out[i] = FLBA{reinterpret_cast<const uint8_t*>(&converted_values_[i])};
+        } else {
+          out[i] = FLBA{};
+        }
+      }
+    }
+    return Status::OK();
+  }
+
+ private:
+  std::vector<uint16_t> converted_values_;
+};
+#endif


Could you share implementation as much as possible something like:

template <> struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> { Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*, FLBA* out) { #if ARROW_LITTLE_ENDIAN return SerializeLittleEndianValues(array.raw_values(), out); #else const uint16_t* values = array.raw_values(); const int64_t length = array.length(); converted_values_.resize(length); for (int64_t i = 0; i < length; ++i) { // We don't need IsValid() here. Non valid values are just ignored in SerializeLittleEndianValues(). converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]); } return SerializeLittleEndianValues(converted_values_.data(), out); #endif } private: Status SerializeLittleEndianValues(const uint16_t* values, FLBA* out) { if (array.null_count() == 0) { for (int64_t i = 0; i < array.length(); ++i) { out[i] = ToFLBA(&values[i]); } } else { for (int64_t i = 0; i < array.length(); ++i) { out[i] = array.IsValid(i) ? ToFLBA(&values[i]) : FLBA{}; } } return Status::OK(); } FLBA ToFLBA(const uint16_t* value_ptr) const { return FLBA{reinterpret_cast<const uint8_t*>(value_ptr)}; } #if !ARROW_LITTLE_ENDIAN std::vector<uint16_t> converted_values_; #endif };

kou · 2025-11-22T13:33:45Z

cpp/src/parquet/column_reader.cc

+#if ARROW_LITTLE_ENDIAN
      if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+      if (num_bytes < 0 || num_bytes > data_size) {
+#endif


@pitrou You added - 4 in #6848. Do you think that we need - 4 with big endian too?

Vishwanatha-HD requested a review from wgtmac as a code owner November 21, 2025 14:18

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 21, 2025

k8ika0s mentioned this pull request Nov 21, 2025

GH-48213: [C++][Parquet] Fix endianness and test failures on s390x (big-endian) (supersedes partial fixes) #48212

Open

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Enable Parquet DB support on Big Endian (IBM Z) systems #48151

Open

apacheGH-48204 Fix Column Reader & Writer logic to enable Parquet DB …

9959543

…support on s390x

Vishwanatha-HD force-pushed the fixColumnReaderWriter branch from 82d9390 to 9959543 Compare November 22, 2025 05:00

kou changed the title ~~GH-48204 Fix Column Reader & Writer logic to enable Parquet DB suppor…~~ GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x Nov 22, 2025

kou reviewed Nov 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x #48205

GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x #48205

Vishwanatha-HD commented Nov 21, 2025 •

edited by kou

Loading

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

kou Nov 22, 2025

Uh oh!

kou Nov 22, 2025

Uh oh!

kou Nov 22, 2025

Uh oh!

kou Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x #48205

Are you sure you want to change the base?

GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x #48205

Conversation

Vishwanatha-HD commented Nov 21, 2025 • edited by kou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

kou Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

kou Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

kou Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

kou Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vishwanatha-HD commented Nov 21, 2025 •

edited by kou

Loading