Skip to content

Conversation

@rynewang
Copy link

@rynewang rynewang commented Jan 6, 2026

Rationale for this change

As noted by @wgtmac in #48717 (comment), calling std::memcpy with a nullptr source is undefined behavior according to the C++ standard, even when the size is 0.

When ByteArray has len=0 and ptr=nullptr (the default-constructed state per types.h:649), the current code invokes UB.

What changes are included in this PR?

Added guards to check len > 0 before calling memcpy in:

  • TypedStatisticsImpl<ByteArrayType>::Copy in statistics.cc
  • DictDecoderImpl<ByteArrayType>::SetDict in decoder.cc
  • Test utilities in statistics_test.cc and column_writer_test.cc

Are these changes tested?

Existing tests pass. The fix prevents potential UB that sanitizers or optimizing compilers could exploit.

Are there any user-facing changes?

No.

…nullptr

When ByteArray has len=0 and ptr=nullptr (the default-constructed state),
calling std::memcpy with nullptr is undefined behavior according to the
C++ standard, even when size is 0.

This commit adds guards to check len > 0 before calling memcpy in:
- TypedStatisticsImpl<ByteArrayType>::Copy (statistics.cc)
- DictDecoderImpl<ByteArrayType>::SetDict (decoder.cc)
- Test utilities (statistics_test.cc, column_writer_test.cc)

Claude-Generated-By: Claude Code (cli/claude-opus-4-5=100%)
Claude-Steers: 1
Claude-Permission-Prompts: 8
Claude-Escapes: 1
@rynewang rynewang requested a review from wgtmac as a code owner January 6, 2026 17:58
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

⚠️ GitHub issue #48744 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member

wgtmac commented Jan 8, 2026

BTW, are there other similar calls in the parquet folder? I just did a quick peek and it seems that

void ConcatenateBuffers(int64_t definition_levels_rle_size,
is a missing part?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 8, 2026
@rynewang
Copy link
Author

rynewang commented Jan 8, 2026

this is getting scattered all the places, so I think we can just provide a SafeMemcpy that does the size>0 check and do memcpy, will update the branch

Introduce a centralized SafeMemcpy function in types.h that guards against
calling std::memcpy with nullptr, which is undefined behavior even when
size is 0.

Replace memcpy calls throughout parquet with SafeMemcpy where the size
is dynamic and could be 0 (ByteArray lengths, buffer sizes, RLE sizes).
Keep std::memcpy for static sizes (sizeof, FLBA type_length) that are
guaranteed > 0.
@rynewang
Copy link
Author

@wgtmac can you help merging this? Thanks!

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is a good idea to wrap memcpy though the performance might not affected if it is inlined. My concern is the inconsistently mixed use of the wrapper and the raw function across the code base which confuses developers on which to choose in the future.

Should we just consider using std::copy_n to replace memcpy in the parquet subdirectory where variable-length is copied to limit the scope of this change? In principle, std::copy_n is safer and can be optimized by compilers.

WDYT @pitrou?

///
/// According to the C++ standard, calling std::memcpy with a nullptr is undefined
/// behavior even when size is 0. This function guards against that case.
static inline void SafeMemcpy(void* dest, const void* src, size_t size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the wrong place to put SafeMemcpy. It seems cpp/src/arrow/util/memory_internal.h is a better home.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have arrow/util/ubsan.h and there's a handy function MakeNonNull there.

But I wouldn't be against adding a SafeMemcpy to that same file if it makes things easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants