-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST(string dtype): Resolve xfail in groupby.test_size #60711
TST(string dtype): Resolve xfail in groupby.test_size #60711
Conversation
expected = Series( | ||
[2, 1], | ||
index=Index(["a", "b"], name="a", dtype=dtype), | ||
index=Index(["a", "b"], name="a", dtype=exp_index_dtype), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't it work if you just remove the dtype
argument and let the constructor infer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach the Int64 is for exp_dtype
on the line below, not for the dtype of the Index being constructed on this line, so I am not entirely understanding your comment/question ?
(the construction of exp_dtype
is not being touched in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks! When the grouping column is StringDtype, we preserve this even when infer_string is False. In the groupby code, the uniques that go into creating the index is a string array. When the input is object dtype and infer_string=True
, we do inference on the values and coerce to dtype str
.
So in the object case we're doing inference, whereas in the non-object case we are not. It seems reasonable to me, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks for the explanation. I am not sure how I feel yet, but at first I wasn't expecting the action of grouping to perform any inference. Is that not a performance hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, if this is not easy to "fix" (avoid the inference), then I am personally fine with the current behaviour for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - I think we are all leaning in that direction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line in question is
pandas/pandas/core/groupby/ops.py
Line 755 in e3b2de8
levels = [Index._with_infer(ping.uniques) for ping in self.groupings] |
While it's been moved around recently, I believe that's long standing behavior. I can investigate what impact removing that (so just using standard Index
init) would have, but seems independent.
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
Manual backport in #60782 |
#60782) Backport PR #60711: TST(string dtype): Resolve xfail in groupby.test_size Co-authored-by: Richard Shadrach <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.groupby does inference on the group labels across the board.
While I agree long-term I'd prefer to preserve object dytpe, I do not think we should be changing this at this point.