GH-45523: [R] Implement Utf8View type bindings#49712
Draft
thisisnic wants to merge 9 commits intoapache:mainfrom
Draft
GH-45523: [R] Implement Utf8View type bindings#49712thisisnic wants to merge 9 commits intoapache:mainfrom
thisisnic wants to merge 9 commits intoapache:mainfrom
Conversation
Member
Author
|
We should rebase once #49710 is merged as changes from that PR are in this branch as they were needed to make it work. |
e919109 to
e8768d0
Compare
e8768d0 to
bb711a7
Compare
thisisnic
commented
May 5, 2026
Comment on lines
+627
to
+630
| cpp11::stop( | ||
| "Cannot convert Dictionary Array of type `%s` to R: dictionary has " | ||
| "more levels than an R factor can represent", | ||
| dict_type.ToString().c_str()); |
Member
Author
There was a problem hiding this comment.
We should add a test for this
thisisnic
commented
May 6, 2026
Comment on lines
+829
to
+830
| return this->value_builder_->Append( | ||
| std::string_view(view_.bytes, static_cast<size_t>(view_.size))); |
Member
Author
There was a problem hiding this comment.
The old code passed (const char*, int32_t) which matches StringBuilder::Append but not StringViewBuilder::Append (which takes int64_t). Switching to std::string_view works for both builder types.
thisisnic
commented
May 6, 2026
Comment on lines
+913
to
+954
| template <typename T> | ||
| class RPrimitiveConverter<T, enable_if_string_view<T>> | ||
| : public PrimitiveConverter<T, RConverter> { | ||
| public: | ||
| Status Extend(SEXP x, int64_t size, int64_t offset = 0) override { | ||
| RVectorType rtype = GetVectorType(x); | ||
| if (rtype != STRING) { | ||
| return Status::Invalid("Expecting a character vector"); | ||
| } | ||
| return UnsafeAppendUtf8Strings(arrow::r::utf8_strings(x), size, offset); | ||
| } | ||
|
|
||
| void DelayedExtend(SEXP values, int64_t size, RTasks& tasks) override { | ||
| auto task = [this, values, size]() { return this->Extend(values, size); }; | ||
| tasks.Append(false, std::move(task)); | ||
| } | ||
|
|
||
| private: | ||
| Status UnsafeAppendUtf8Strings(const cpp11::strings& s, int64_t size, int64_t offset) { | ||
| RETURN_NOT_OK(this->primitive_builder_->Reserve(size - offset)); | ||
| const SEXP* p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset; | ||
|
|
||
| int64_t total_length = 0; | ||
| for (R_xlen_t i = offset; i < size; i++, ++p_strings) { | ||
| SEXP si = *p_strings; | ||
| total_length += si == NA_STRING ? 0 : LENGTH(si); | ||
| } | ||
| RETURN_NOT_OK(this->primitive_builder_->ReserveData(total_length)); | ||
|
|
||
| p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset; | ||
| for (R_xlen_t i = offset; i < size; i++, ++p_strings) { | ||
| SEXP si = *p_strings; | ||
| if (si == NA_STRING) { | ||
| this->primitive_builder_->UnsafeAppendNull(); | ||
| } else { | ||
| this->primitive_builder_->UnsafeAppend(CHAR(si), LENGTH(si)); | ||
| } | ||
| } | ||
|
|
||
| return Status::OK(); | ||
| } | ||
| }; |
Member
Author
There was a problem hiding this comment.
I'm less confident about this entire block of code
Member
Author
|
A lot of |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
No bindings for Utf8View type in the R package
What changes are included in this PR?
Implement bindings
Are these changes tested?
Yep
Are there any user-facing changes?
Yep, adding functionality.
AI Usage
Heavily used Codex/Claude here. I'm not confident of every line of code. I read things over, and iterated on it making sure that tests pass and nothing seemed wildly incorrect.