Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST(string dtype): Resolve xfail in groupby.test_size #60711

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

rhshadrach
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

groupby does inference on the group labels across the board.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]}, dtype="object")
gb = df.groupby("a")
result = gb.sum()
print(result.index.dtype)
# int64

While I agree long-term I'd prefer to preserve object dytpe, I do not think we should be changing this at this point.

@rhshadrach rhshadrach added Testing pandas testing functions or related to the test suite Groupby Strings String extension data type and string data labels Jan 13, 2025
@rhshadrach rhshadrach added this to the 2.3 milestone Jan 13, 2025
expected = Series(
[2, 1],
index=Index(["a", "b"], name="a", dtype=dtype),
index=Index(["a", "b"], name="a", dtype=exp_index_dtype),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't it work if you just remove the dtype argument and let the constructor infer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - this was introduced in #55627 but I do not see why if the values are string[pyarrow] that the result would be Int64.

cc @phofl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach the Int64 is for exp_dtype on the line below, not for the dtype of the Index being constructed on this line, so I am not entirely understanding your comment/question ?
(the construction of exp_dtype is not being touched in this PR)

Copy link
Member Author

@rhshadrach rhshadrach Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks! When the grouping column is StringDtype, we preserve this even when infer_string is False. In the groupby code, the uniques that go into creating the index is a string array. When the input is object dtype and infer_string=True, we do inference on the values and coerce to dtype str.

So in the object case we're doing inference, whereas in the non-object case we are not. It seems reasonable to me, thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks for the explanation. I am not sure how I feel yet, but at first I wasn't expecting the action of grouping to perform any inference. Is that not a performance hit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, if this is not easy to "fix" (avoid the inference), then I am personally fine with the current behaviour for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - I think we are all leaning in that direction

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line in question is

levels = [Index._with_infer(ping.uniques) for ping in self.groupings]

While it's been moved around recently, I believe that's long standing behavior. I can investigate what impact removing that (so just using standard Index init) would have, but seems independent.

@rhshadrach rhshadrach marked this pull request as draft January 14, 2025 02:38
@rhshadrach rhshadrach marked this pull request as ready for review January 22, 2025 02:08
@WillAyd WillAyd merged commit 354b61f into pandas-dev:main Jan 24, 2025
57 checks passed
Copy link

lumberbot-app bot commented Jan 24, 2025

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 354b61f88bc0523d4bb9f3cfe1d6c12f9a3d6567
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #60711: TST(string dtype): Resolve xfail in groupby.test_size'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-60711-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60711 on branch 2.3.x (TST(string dtype): Resolve xfail in groupby.test_size)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Jan 24, 2025
@WillAyd
Copy link
Member

WillAyd commented Jan 24, 2025

Manual backport in #60782

jorisvandenbossche pushed a commit that referenced this pull request Jan 24, 2025
#60782)

Backport PR #60711: TST(string dtype): Resolve xfail in groupby.test_size

Co-authored-by: Richard Shadrach <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Strings String extension data type and string data Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants