Skip to content

GH-49002: [Python] Fix array.to_pandas string type conversion for arrays with None#49247

Merged
AlenkaF merged 4 commits into
apache:mainfrom
AlenkaF:gh-49002-pandas-string-to_pandas-empty
Apr 1, 2026
Merged

GH-49002: [Python] Fix array.to_pandas string type conversion for arrays with None#49247
AlenkaF merged 4 commits into
apache:mainfrom
AlenkaF:gh-49002-pandas-string-to_pandas-empty

Conversation

@AlenkaF
Copy link
Copy Markdown
Member

@AlenkaF AlenkaF commented Feb 11, 2026

Rationale for this change

The conversion from array with string type to pandas series, when array only has a None element, has been taking the old code path even with pandas 3.0.

What changes are included in this PR?

Always check dtype in the _array_like_to_pandas conversion and use pandas new default string dtype if available.

Are these changes tested?

Yes.

Are there any user-facing changes?

No, only bug fix.

Copy link
Copy Markdown
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AlenkaF this looks good to me. Seems to be what was proposed by @jorisvandenbossche in the issue and in-line to what we have here:

# for pandas 3.0+, use pandas' new default string dtype
if _pandas_api.uses_string_dtype() and not strings_to_categorical:
for field in table.schema:
if field.name not in ext_columns and (
pa.types.is_string(field.type)
or pa.types.is_large_string(field.type)
or pa.types.is_string_view(field.type)
) and field.name not in categories:
ext_columns[field.name] = _pandas_api.pd.StringDtype(na_value=np.nan)

I'll wait until end of day in case @jorisvandenbossche has time to take a look otherwise I'll merge.

@github-actions github-actions Bot removed the awaiting review Awaiting review label Feb 12, 2026
@raulcd raulcd self-requested a review February 12, 2026 08:35
@github-actions github-actions Bot added the awaiting merge Awaiting merge label Feb 12, 2026
Copy link
Copy Markdown
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry, I've just realised there are several test failures which are related and we should fix. Should have checked CI before :)

@raulcd raulcd self-requested a review February 12, 2026 08:36
@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Feb 12, 2026
Comment thread python/pyarrow/array.pxi Outdated
Comment thread python/pyarrow/array.pxi Outdated
@AlenkaF
Copy link
Copy Markdown
Member Author

AlenkaF commented Feb 12, 2026

Thanks for quick reviews! Will go through the comments now - was going through all the tests I just broke =) Should have put the PR back to draft, will do so next time.

@AlenkaF AlenkaF marked this pull request as draft February 12, 2026 12:05
@AlenkaF AlenkaF force-pushed the gh-49002-pandas-string-to_pandas-empty branch from 6f1fda5 to fec9077 Compare March 25, 2026 14:36
@github-actions github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 25, 2026
@AlenkaF AlenkaF marked this pull request as ready for review March 25, 2026 15:49
@AlenkaF
Copy link
Copy Markdown
Member Author

AlenkaF commented Mar 25, 2026

Ok, this should be ready now. cc @raulcd @jorisvandenbossche for another round of review.

Copy link
Copy Markdown
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlenkaF I am not super familiar with this but in general looks good to me. A minor nit for a typo on a comment

def test_zero_copy_failure_on_object_types(self):
self.check_zero_copy_failure(pa.array(['A', 'B', 'C']))
if Version(pd.__version__) < Version("3.0.0"):
# pandas 3.0 includes default string dtype support
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand why this test has to have this guard now. Isn't it supposed to work with pandas > 3.0.0?
I suppose this is because we are testing object types specifically. Was this test failing on CI? I haven't seen the failure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be connected to the change I made in this PR as strings are not converted to pandas object anymore. But looking at the test it might be a leftover from my previous wrong approach. Thanks for the comment, I need to check this!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, got it. This test checks that strings can not be zero copied to Pandas. Which has been true in the past as the C++ machinery constructed an object type from Pyarrow string type. Now, with pandas 3.0.0 we can move through __from_arrow__ where no copies are needed.

Running this test locally with pandas 3.0.0 gives following error:
______________________________________________ TestZeroCopyConversion.test_zero_copy_failure_on_object_types _______________________________________________

self = <pyarrow.tests.test_pandas.TestZeroCopyConversion object at 0x156a0af90>

    def test_zero_copy_failure_on_object_types(self):
>       self.check_zero_copy_failure(pa.array(['A', 'B', 'C']))

python/pyarrow/tests/test_pandas.py:2978: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pyarrow.tests.test_pandas.TestZeroCopyConversion object at 0x156a0af90>
arr = <pyarrow.lib.StringArray object at 0x15699b700>
[
  "A",
  "B",
  "C"
]

    def check_zero_copy_failure(self, arr):
>       with pytest.raises(pa.ArrowInvalid):
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       Failed: DID NOT RAISE <class 'pyarrow.lib.ArrowInvalid'>

python/pyarrow/tests/test_pandas.py:2974: Failed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from what I can see this is an expected change, since string conversion will now actually be zero copy

(although, strictly speaking, it is not actually zero-copy entirely, because the test here is using string, and pandas will convert that to large_string. But I suppose that happens outside the view of pyarrow)

Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, the zero_copy_only keyword is ignored whenever the conversion goes through dtype.__from_arrow__ .. (same for other options), so it is not even about no longer making a copy or not in pandas 3.0, just about using an ExtensionDtype

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I see. Should this be changed when dealing with Extension types? I know we have a list of things to work on when it comes to this topic and we can open up an umbrella issue with all possible improvements.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to easily improve this .. (since we defer to pandas for the conversion, and that method we call does not have those keywords)

(long term I would like to see this logic to be moved entirely to pandas)

Comment thread python/pyarrow/array.pxi Outdated
@github-actions github-actions Bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 1, 2026
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Apr 1, 2026
Comment thread python/pyarrow/tests/test_pandas.py
Comment thread python/pyarrow/tests/test_pandas.py Outdated
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@github-actions github-actions Bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 1, 2026
@github-actions github-actions Bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Apr 1, 2026
@AlenkaF AlenkaF merged commit 008e082 into apache:main Apr 1, 2026
20 checks passed
@AlenkaF AlenkaF removed the awaiting merge Awaiting merge label Apr 1, 2026
@AlenkaF AlenkaF deleted the gh-49002-pandas-string-to_pandas-empty branch April 1, 2026 14:01
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 008e082.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them.

thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Apr 6, 2026
…or arrays with None (apache#49247)

### Rationale for this change

The conversion from array with string type to pandas series, when array only has a `None` element, has been taking the old code path even with pandas 3.0.

### What changes are included in this PR?

Always check `dtype`  in the `_array_like_to_pandas ` conversion and use pandas new default string `dtype` if available.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No, only bug fix.
* GitHub Issue: apache#49002

Lead-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Mottl pushed a commit to Mottl/arrow that referenced this pull request May 26, 2026
…or arrays with None (apache#49247)

### Rationale for this change

The conversion from array with string type to pandas series, when array only has a `None` element, has been taking the old code path even with pandas 3.0.

### What changes are included in this PR?

Always check `dtype`  in the `_array_like_to_pandas ` conversion and use pandas new default string `dtype` if available.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No, only bug fix.
* GitHub Issue: apache#49002

Lead-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants