GH-49002: [Python] Fix array.to_pandas string type conversion for arrays with None#49247
Conversation
raulcd
left a comment
There was a problem hiding this comment.
Thanks @AlenkaF this looks good to me. Seems to be what was proposed by @jorisvandenbossche in the issue and in-line to what we have here:
arrow/python/pyarrow/pandas_compat.py
Lines 936 to 944 in d2315fe
I'll wait until end of day in case @jorisvandenbossche has time to take a look otherwise I'll merge.
raulcd
left a comment
There was a problem hiding this comment.
Oops, sorry, I've just realised there are several test failures which are related and we should fix. Should have checked CI before :)
|
Thanks for quick reviews! Will go through the comments now - was going through all the tests I just broke =) Should have put the PR back to draft, will do so next time. |
6f1fda5 to
fec9077
Compare
|
Ok, this should be ready now. cc @raulcd @jorisvandenbossche for another round of review. |
| def test_zero_copy_failure_on_object_types(self): | ||
| self.check_zero_copy_failure(pa.array(['A', 'B', 'C'])) | ||
| if Version(pd.__version__) < Version("3.0.0"): | ||
| # pandas 3.0 includes default string dtype support |
There was a problem hiding this comment.
I am not sure I understand why this test has to have this guard now. Isn't it supposed to work with pandas > 3.0.0?
I suppose this is because we are testing object types specifically. Was this test failing on CI? I haven't seen the failure.
There was a problem hiding this comment.
This should be connected to the change I made in this PR as strings are not converted to pandas object anymore. But looking at the test it might be a leftover from my previous wrong approach. Thanks for the comment, I need to check this!
There was a problem hiding this comment.
OK, got it. This test checks that strings can not be zero copied to Pandas. Which has been true in the past as the C++ machinery constructed an object type from Pyarrow string type. Now, with pandas 3.0.0 we can move through __from_arrow__ where no copies are needed.
Running this test locally with pandas 3.0.0 gives following error:
______________________________________________ TestZeroCopyConversion.test_zero_copy_failure_on_object_types _______________________________________________
self = <pyarrow.tests.test_pandas.TestZeroCopyConversion object at 0x156a0af90>
def test_zero_copy_failure_on_object_types(self):
> self.check_zero_copy_failure(pa.array(['A', 'B', 'C']))
python/pyarrow/tests/test_pandas.py:2978:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pyarrow.tests.test_pandas.TestZeroCopyConversion object at 0x156a0af90>
arr = <pyarrow.lib.StringArray object at 0x15699b700>
[
"A",
"B",
"C"
]
def check_zero_copy_failure(self, arr):
> with pytest.raises(pa.ArrowInvalid):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E Failed: DID NOT RAISE <class 'pyarrow.lib.ArrowInvalid'>
python/pyarrow/tests/test_pandas.py:2974: FailedThere was a problem hiding this comment.
Yes, from what I can see this is an expected change, since string conversion will now actually be zero copy
(although, strictly speaking, it is not actually zero-copy entirely, because the test here is using string, and pandas will convert that to large_string. But I suppose that happens outside the view of pyarrow)
There was a problem hiding this comment.
Essentially, the zero_copy_only keyword is ignored whenever the conversion goes through dtype.__from_arrow__ .. (same for other options), so it is not even about no longer making a copy or not in pandas 3.0, just about using an ExtensionDtype
There was a problem hiding this comment.
Oh yes, I see. Should this be changed when dealing with Extension types? I know we have a list of things to work on when it comes to this topic and we can open up an umbrella issue with all possible improvements.
There was a problem hiding this comment.
I am not sure how to easily improve this .. (since we defer to pandas for the conversion, and that method we call does not have those keywords)
(long term I would like to see this logic to be moved entirely to pandas)
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 008e082. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…or arrays with None (apache#49247) ### Rationale for this change The conversion from array with string type to pandas series, when array only has a `None` element, has been taking the old code path even with pandas 3.0. ### What changes are included in this PR? Always check `dtype` in the `_array_like_to_pandas ` conversion and use pandas new default string `dtype` if available. ### Are these changes tested? Yes. ### Are there any user-facing changes? No, only bug fix. * GitHub Issue: apache#49002 Lead-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
…or arrays with None (apache#49247) ### Rationale for this change The conversion from array with string type to pandas series, when array only has a `None` element, has been taking the old code path even with pandas 3.0. ### What changes are included in this PR? Always check `dtype` in the `_array_like_to_pandas ` conversion and use pandas new default string `dtype` if available. ### Are these changes tested? Yes. ### Are there any user-facing changes? No, only bug fix. * GitHub Issue: apache#49002 Lead-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Rationale for this change
The conversion from array with string type to pandas series, when array only has a
Noneelement, has been taking the old code path even with pandas 3.0.What changes are included in this PR?
Always check
dtypein the_array_like_to_pandasconversion and use pandas new default stringdtypeif available.Are these changes tested?
Yes.
Are there any user-facing changes?
No, only bug fix.
None#49002