Skip to content

Conversation

@scott-routledge2
Copy link
Contributor

@scott-routledge2 scott-routledge2 commented Nov 19, 2025

Rationale for this change

Casting Binary offset -> Binary offset types relies on ZeroCopyCastExec, which propagates the offset of the input to the output. This can lead to larger allocations than necessary when casting arrays with offsets.

See #43660 and
#43661 for more context.

What changes are included in this PR?

Ensure output array has a small offset (it can still be non-zero since reusing the null bitmap requires in_offset % 8 == out_offset % 8)

Are these changes tested?

Ran unit tests and benchmarked locally.

Are there any user-facing changes?

No

@felipecrv felipecrv self-requested a review November 19, 2025 18:17
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement! But you should add tests of casting of [Large]String containing non-zero and >8 offsets. I don't think your current implementation would work without also slicing the offsets buffer.

} else {
output->offset = 0;
}

Copy link
Contributor

@felipecrv felipecrv Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure all cases are covered correctly here. enable_if_t<is_base_binary_type<I>::value && is_base_binary_type<O>::value, Status> covers all pairs of Binary/LargeBinary/String/LargeString.

if constexpr (!I::is_utf8 && O::is_utf8) { handles the UTF8 validation in case we are going from {Binary, LargeBinary} to {String, LargeString}.

The allocation issue is only a problem when going from 32 to 64 or 64 to 32, so you can shape your new code like this:

if constexpr (sizeof(typename I::offset_type) != sizeof(typename O::offset_type)) {
  std::shared_ptr<ArrayData> input_arr = input.ToArrayData();
  ArrayData* output = out->array_data().get();
  // ...
  if (output->buffers[0]) {
    ...


  return CastBinaryToBinaryOffsets<typename I::offset_type, typename O::offset_type>(
        ctx, input, out->array_data().get()); 
} else {
  return ZeroCopyCastExec(ctx, batch, out);
}

Useful diagram I made some time ago: https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db

Comment on lines +310 to +311
output->length = input_arr->length;
output->SetNullCount(input_arr->null_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output->length = input_arr->length;
output->SetNullCount(input_arr->null_count);
output->length = input_arr->length;
// output->offset is set below
output->SetNullCount(input_arr->null_count);

output->length = input_arr->length;
output->SetNullCount(input_arr->null_count);
output->buffers = std::move(input_arr->buffers);
output->child_data = std::move(input_arr->child_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output->child_data = std::move(input_arr->child_data);
// binary/string arrays don't have child_data

Comment on lines +315 to +321
if (output->buffers[0]) {
// If reusing the null bitmap, ensure offset into the first byte is the same as input.
output->offset = input_arr->offset % 8;
output->buffers[0] = SliceBuffer(output->buffers[0], input_arr->offset / 8);
} else {
output->offset = 0;
}
Copy link
Contributor

@felipecrv felipecrv Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you should also slice the offsets buffer when you change the array offset.

Suggested change
if (output->buffers[0]) {
// If reusing the null bitmap, ensure offset into the first byte is the same as input.
output->offset = input_arr->offset % 8;
output->buffers[0] = SliceBuffer(output->buffers[0], input_arr->offset / 8);
} else {
output->offset = 0;
}
// slice buffers to reduce allocation when casting the offsets buffer
auto length = input_arr->length;
int64_t buffer_offset = 0;
if (output->null_count != 0 && output->buffers[0]) {
// avoiding reallocation of the validity buffer by allowing some padding bits
output->offset = input_arr->offset % 8;
buffer_offset = input_arr->offset / 8;
} else {
output->offset = 0;
buffer_offset = input_arr->offset;
}
output->buffers[0] = SliceBuffer(output->buffers[0], buffer_offset, length);
output->buffers[1] = SliceBuffer(output->buffers[1], buffer_offset, length);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review my suggestions and tell if something is weird. This stuff is hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! I didn't slice the offset buffers originally because they are reallocated by CastBinaryToBinaryOffsets with the correct offset. Although, this would only be true for the case where the offset type changes, in the case where the offset stays the same then we would either have to slice here or call ZeroCopyCastExec like in the comment above?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants