Skip to content

[rust] Add end-to-end ROW type#544

Merged
fresh-borzoni merged 2 commits into
apache:mainfrom
fresh-borzoni:row-type-revive
May 13, 2026
Merged

[rust] Add end-to-end ROW type#544
fresh-borzoni merged 2 commits into
apache:mainfrom
fresh-borzoni:row-type-revive

Conversation

@fresh-borzoni
Copy link
Copy Markdown
Member

@fresh-borzoni fresh-borzoni commented May 10, 2026

Continues work from #442 (which went stale).
Original implementation by @hemanthsavasere and this PR rebases on current main, addresses review feedback and adds the follow-ups listed below.

Closes #388
Closes #442

Original work

  • Datum::Row, InternalRow::get_row, CompactedRow ROW deserializer
  • NestedRowWriter, ROW arm in compacted key encoder
  • ROW round-trip unit tests

Added in this revival

  • field_id machinery (server requires unique ids across nested ROW —
    matches Java's ReassignFieldId)
  • ARRAY<ROW> support
  • Arrow log-path nested-ROW disambiguation
  • Per-row Vec<OnceLock> allocation killed
  • new tests

left followup #543 to address performance with nested structures, as this is PR is big enough and the gap comes from pre-existing code

- Add `Datum::Row(Box<GenericRow>)` variant with `as_row()` accessor
- Add `get_row()` to `InternalRow` trait with default error impl
- Implement `GenericRow::get_row()` and `CompactedRow::get_row()` delegation
- Implement `ColumnarRow::get_row()` with Arrow StructArray extraction + OnceLock caching
- Add `InnerValueWriter::Row(RowType)` and write path via nested CompactedRowWriter
- Add `DataType::Row` arm in `CompactedRowDeserializer` for eager nested decode
- Add `InnerFieldGetter::Row` and hook up FieldGetter/ValueWriter pipeline
- Handle `Datum::Row` in `resolve_row_types` (C++ bindings)
- Add round-trip tests: simple nesting, deep nesting, nullable fields, ROW as primary key

Wire format matches Java: varint-length-prefixed blob of a complete CompactedRow.
@fresh-borzoni
Copy link
Copy Markdown
Member Author

@leekeiabstraction @charlesdong1991 @luoyuxia PTAL 🙏

@fresh-borzoni fresh-borzoni force-pushed the row-type-revive branch 3 times, most recently from 54c8908 to 4ad3a28 Compare May 10, 2026 20:02
@charlesdong1991
Copy link
Copy Markdown
Contributor

Nice! thanks for reviving the PR, will take a look tomorrow! 🙏

Copy link
Copy Markdown
Contributor

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work, overall looks great!! 👍 Honestly i paid a bit more attention on the addition in this revival 😅 and only have a couple minor comments, given the size of PR, we can probably better to have follow-up PRs instead.

In follow-up PRs we can also add corresponding docs to reflect the new change!

let column = self.record_batch.column(pos);
// Children of a null parent may carry stale bytes; caller must
// check is_null_at first rather than rely on what we'd read.
if column.is_null(self.row_id) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder should user check is_null_at before get_row?

Copy link
Copy Markdown
Member Author

@fresh-borzoni fresh-borzoni May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will disappear after #543

self.row_id = row_id
self.row_id = row_id;
for lock in self.row_caches.iter_mut() {
*lock = std::sync::OnceLock::new();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mentioned in PR description that Per-row Vec<OnceLock> allocation killed, looking at this, i wonder if it is still the case that it will allocate OnceLock on every row iteration?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

// TODO: add Map and Row field getter support once their binary forms are implemented.
// TODO: add Map field getter support once its binary form is implemented.
InnerFieldGetter::Array { pos } => Datum::Array(row.get_array(*pos)?),
InnerFieldGetter::Row { pos } => Datum::Row(Box::new(row.get_row(*pos)?.clone())),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder how performant this will be since it seems it clones the whole generic row? maybe we can flag awareness on hot scan paths here and revisit if needed later because the PR is huge already

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@fresh-borzoni
Copy link
Copy Markdown
Member Author

Ty for the review @charlesdong1991
Yes, I have followup planned, where I plan to switch to typed writers, so I will include docs there as well

@fresh-borzoni fresh-borzoni merged commit 2b719f1 into apache:main May 13, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Row data type support in Rust

3 participants