GH-49002: [Python] Fix array.to_pandas string type conversion for arrays with None by AlenkaF · Pull Request #49247 · apache/arrow

AlenkaF · 2026-02-11T20:29:50Z

Rationale for this change

The conversion from array with string type to pandas series, when array only has a None element, has been taking the old code path even with pandas 3.0.

What changes are included in this PR?

Always check dtype in the _array_like_to_pandas conversion and use pandas new default string dtype if available.

Are these changes tested?

Yes.

Are there any user-facing changes?

No, only bug fix.

GitHub Issue: [Python] Wrong cast from StringArray to pandas 3 when element is None #49002

raulcd

Thanks @AlenkaF this looks good to me. Seems to be what was proposed by @jorisvandenbossche in the issue and in-line to what we have here:

arrow/python/pyarrow/pandas_compat.py

Lines 936 to 944 in d2315fe

    
           # for pandas 3.0+, use pandas' new default string dtype 
        
           if _pandas_api.uses_string_dtype() and not strings_to_categorical: 
        
               for field in table.schema: 
        
                   if field.name not in ext_columns and ( 
        
                       pa.types.is_string(field.type) 
        
                       or pa.types.is_large_string(field.type) 
        
                       or pa.types.is_string_view(field.type) 
        
                   ) and field.name not in categories: 
        
                       ext_columns[field.name] = _pandas_api.pd.StringDtype(na_value=np.nan)

I'll wait until end of day in case @jorisvandenbossche has time to take a look otherwise I'll merge.

raulcd

Oops, sorry, I've just realised there are several test failures which are related and we should fix. Should have checked CI before :)

AlenkaF · 2026-02-12T11:59:08Z

Thanks for quick reviews! Will go through the comments now - was going through all the tests I just broke =) Should have put the PR back to draft, will do so next time.

AlenkaF · 2026-03-25T15:49:50Z

Ok, this should be ready now. cc @raulcd @jorisvandenbossche for another round of review.

raulcd

@AlenkaF I am not super familiar with this but in general looks good to me. A minor nit for a typo on a comment

raulcd · 2026-04-01T07:50:08Z

    def test_zero_copy_failure_on_object_types(self):
-        self.check_zero_copy_failure(pa.array(['A', 'B', 'C']))
+        if Version(pd.__version__) < Version("3.0.0"):
+            # pandas 3.0 includes default string dtype support


~~I am not sure I understand why this test has to have this guard now. Isn't it supposed to work with pandas > 3.0.0?~~
I suppose this is because we are testing object types specifically. Was this test failing on CI? I haven't seen the failure.

This should be connected to the change I made in this PR as strings are not converted to pandas object anymore. But looking at the test it might be a leftover from my previous wrong approach. Thanks for the comment, I need to check this!

OK, got it. This test checks that strings can not be zero copied to Pandas. Which has been true in the past as the C++ machinery constructed an object type from Pyarrow string type. Now, with pandas 3.0.0 we can move through __from_arrow__ where no copies are needed.

Running this test locally with pandas 3.0.0 gives following error:

______________________________________________ TestZeroCopyConversion.test_zero_copy_failure_on_object_types _______________________________________________ self = <pyarrow.tests.test_pandas.TestZeroCopyConversion object at 0x156a0af90> def test_zero_copy_failure_on_object_types(self): > self.check_zero_copy_failure(pa.array(['A', 'B', 'C'])) python/pyarrow/tests/test_pandas.py:2978: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <pyarrow.tests.test_pandas.TestZeroCopyConversion object at 0x156a0af90> arr = <pyarrow.lib.StringArray object at 0x15699b700> [ "A", "B", "C" ] def check_zero_copy_failure(self, arr): > with pytest.raises(pa.ArrowInvalid): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E Failed: DID NOT RAISE <class 'pyarrow.lib.ArrowInvalid'> python/pyarrow/tests/test_pandas.py:2974: Failed

Yes, from what I can see this is an expected change, since string conversion will now actually be zero copy

(although, strictly speaking, it is not actually zero-copy entirely, because the test here is using string, and pandas will convert that to large_string. But I suppose that happens outside the view of pyarrow)

Essentially, the zero_copy_only keyword is ignored whenever the conversion goes through dtype.__from_arrow__ .. (same for other options), so it is not even about no longer making a copy or not in pandas 3.0, just about using an ExtensionDtype

Oh yes, I see. Should this be changed when dealing with Extension types? I know we have a list of things to work on when it comes to this topic and we can open up an umbrella issue with all possible improvements.

I am not sure how to easily improve this .. (since we defer to pandas for the conversion, and that method we call does not have those keywords)

(long term I would like to see this logic to be moved entirely to pandas)

Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

conbench-apache-arrow · 2026-04-01T21:44:23Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 008e082.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them.

…or arrays with None (apache#49247) ### Rationale for this change The conversion from array with string type to pandas series, when array only has a `None` element, has been taking the old code path even with pandas 3.0. ### What changes are included in this PR? Always check `dtype` in the `_array_like_to_pandas ` conversion and use pandas new default string `dtype` if available. ### Are these changes tested? Yes. ### Are there any user-facing changes? No, only bug fix. * GitHub Issue: apache#49002 Lead-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>

AlenkaF requested review from raulcd and rok as code owners February 11, 2026 20:29

github-actions Bot added awaiting review Awaiting review Component: Python labels Feb 11, 2026

raulcd approved these changes Feb 12, 2026

View reviewed changes

github-actions Bot removed the awaiting review Awaiting review label Feb 12, 2026

raulcd self-requested a review February 12, 2026 08:35

github-actions Bot added the awaiting merge Awaiting merge label Feb 12, 2026

raulcd reviewed Feb 12, 2026

View reviewed changes

raulcd self-requested a review February 12, 2026 08:36

github-actions Bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Feb 12, 2026

jorisvandenbossche reviewed Feb 12, 2026

View reviewed changes

Comment thread python/pyarrow/array.pxi Outdated

jorisvandenbossche reviewed Feb 12, 2026

View reviewed changes

Comment thread python/pyarrow/array.pxi Outdated

AlenkaF marked this pull request as draft February 12, 2026 12:05

AlenkaF added 2 commits March 25, 2026 15:35

Initial commit

92966b4

Add suggestions, fix typo

fec9077

AlenkaF force-pushed the gh-49002-pandas-string-to_pandas-empty branch from 6f1fda5 to fec9077 Compare March 25, 2026 14:36

github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 25, 2026

AlenkaF marked this pull request as ready for review March 25, 2026 15:49

raulcd approved these changes Apr 1, 2026

View reviewed changes

github-actions Bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 1, 2026

Update python/pyarrow/array.pxi

600c4e4

Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>

github-actions Bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Apr 1, 2026

jorisvandenbossche reviewed Apr 1, 2026

View reviewed changes

Comment thread python/pyarrow/tests/test_pandas.py

jorisvandenbossche reviewed Apr 1, 2026

View reviewed changes

Comment thread python/pyarrow/tests/test_pandas.py Outdated

Apply suggestion from @jorisvandenbossche

3f6af35

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

github-actions Bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 1, 2026

jorisvandenbossche approved these changes Apr 1, 2026

View reviewed changes

github-actions Bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Apr 1, 2026

AlenkaF merged commit 008e082 into apache:main Apr 1, 2026
20 checks passed

AlenkaF removed the awaiting merge Awaiting merge label Apr 1, 2026

AlenkaF deleted the gh-49002-pandas-string-to_pandas-empty branch April 1, 2026 14:01

AlenkaF mentioned this pull request Apr 1, 2026

[Python] Wrong cast from StringArray to pandas 3 when element is None #49002

Closed

	# for pandas 3.0+, use pandas' new default string dtype
	if _pandas_api.uses_string_dtype() and not strings_to_categorical:
	for field in table.schema:
	if field.name not in ext_columns and (
	pa.types.is_string(field.type)
	or pa.types.is_large_string(field.type)
	or pa.types.is_string_view(field.type)
	) and field.name not in categories:
	ext_columns[field.name] = _pandas_api.pd.StringDtype(na_value=np.nan)

Conversation

AlenkaF commented Feb 11, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AlenkaF commented Feb 12, 2026

Uh oh!

AlenkaF commented Mar 25, 2026

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

raulcd Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

AlenkaF Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

AlenkaF Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlenkaF Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

conbench-apache-arrow Bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlenkaF commented Feb 11, 2026 •

edited by github-actions Bot

Loading

jorisvandenbossche Apr 1, 2026 •

edited

Loading