-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: added regex argument to Series.str.split #44185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @saehuihwang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-11-02 23:23:40 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of quick comments @saehuihwang
pandas/core/strings/accessor.py
Outdated
@@ -657,7 +657,7 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"): | |||
|
|||
Parameters | |||
---------- | |||
pat : str, optional | |||
pat : str, or compiled regex optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pat : str, or compiled regex optional | |
pat : str or compiled regex, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! change has been made
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@attack68 sorry, I had missed the first comma. change has now been made according to your suggestion. Sorry for missing it earlier
pandas/core/strings/accessor.py
Outdated
>>> s = pd.Series(['fooojpgbar.jpg']) | ||
>>> s.str.split(r".", expand=True) | ||
0 1 | ||
0 fooojpgbar jpg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of this new example, seems confusing, and not sure what difference it is showing. Can it be simplified to better demonstrate the new features? I quite like the original actually, although text based like:
s = pd.Series(["foo and bar plus baz"])
s.str.split(r"and|plus", expand=True)
0 1 2
0 foo bar baz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @attack68, thanks for reviewing my pull request!
I assumed that a lot of people use st.split to handle urls and file names, so I wanted to include a similar example that includes a ".xxx" at the end. The new set of examples also helps illustrate "." be used as a regex and as a regular string depending on the regex
flag.
What do you think? I can go ahead and include the original examples but also include a few more examples that illustrate the new feature?
@@ -34,6 +34,27 @@ def test_split(any_string_dtype): | |||
exp = Series([["a", "b", "c"], ["c", "d", "e"], np.nan, ["f", "g", "h"]]) | |||
tm.assert_series_equal(result, exp) | |||
|
|||
# explicit regex = True split | |||
values = Series("qweqwejpgqweqwe.jpg", dtype=any_string_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a less confusing test to work with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the test from the original issue #43563 so I thought that it might be best to include the same string. Do you think it's better to use a simpler one, perhaps the ones I put in the examples section of the doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have tests from all the issues you are closing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the test from the original issue #43563 so I thought that it might be best to include the same string. Do you think it's better to use a simpler one, perhaps the ones I put in the examples section of the doc?
My philosophy is that you can use either the same test, or modified slightly or a better test depending upon whether it suits the purpose. Whilst a subjective opinion, the text qweqwejpgqweqwe.jpg
is obfuscatingly confusing and a modification to this would make the test more readable yet still applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! It was a bit of a struggle to come up with a good test string haha. The best I came up with was "xxxjpgzzz.jpg". Let me know if you have better ideas
pandas/core/strings/accessor.py
Outdated
@@ -669,6 +669,16 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"): | |||
* If ``True``, return DataFrame/MultiIndex expanding dimensionality. | |||
* If ``False``, return Series/Index, containing lists of strings. | |||
|
|||
regex : bool, default None | |||
Determines whether to handle the pattern as a regular expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded 1.4 tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay!
@@ -34,6 +34,27 @@ def test_split(any_string_dtype): | |||
exp = Series([["a", "b", "c"], ["c", "d", "e"], np.nan, ["f", "g", "h"]]) | |||
tm.assert_series_equal(result, exp) | |||
|
|||
# explicit regex = True split | |||
values = Series("qweqwejpgqweqwe.jpg", dtype=any_string_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have tests from all the issues you are closing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. if you can do that small refactor in this PR would be great.
cc @simonjayhawkins @Dr-Irv if any comments
n=-1, | ||
expand=False, | ||
regex: bool | None = None, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to factor this logic of pattern stuff into a common function to share with str_replace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/pandas/core/strings/object_array.py
Lines 152 to 160 in b0992ee
if case is False: | |
# add case flag, if provided | |
flags |= re.IGNORECASE | |
if regex or flags or callable(repl): | |
if not isinstance(pat, re.Pattern): | |
if regex is False: | |
pat = re.escape(pat) | |
pat = re.compile(pat, flags=flags) |
Are you referring to this part? Unfortunately, str_replace and str_split handle arguments quite differently. str_replace has two additional arguments case
and flags
, while str_split does not. str_split handles the case when regex=None
by using the weird logic with len(pat)
, while str_replace does not.
In a set of another PRs, I can do the following to str_split
-
deprecation of value dependent regex determination
currently, whenregex==None
,len(pat)==1
is handled as a literal string andlen(pat)!=1
is handled as a regex. -
addition of
case
andflag
arguments -
common handling of logic with str_replace
But for now, I think str_replace and str_split are not similar enough to share a common logic handling function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while str_split does not. str_split handles the case when regex=None by using the weird logic with len(pat),
then shouldn't we ad case
and flags
and make these consistent? what do we need to deprecate here?
yes there is weird logic by using len(pat)
but this is new logic you are adding no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not add the len(pat) logic - it has always been there. I purposely didn't cut it out to maintain current behavior. I think that we should remove it in the future.
pandas/pandas/core/strings/object_array.py
Lines 317 to 325 in 2fa2d5c
if len(pat) == 1: | |
if n is None or n == 0: | |
n = -1 | |
f = lambda x: x.split(pat, n) | |
else: | |
if n is None or n == -1: | |
n = 0 | |
regex = re.compile(pat) | |
f = lambda x: regex.split(x, maxsplit=n) |
In this PR, I am simply adding the regex flag, as requested by several issues. I can go ahead and add the
case
and flags
if you think that is a good idea.Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that you didn't change the current len(pat)
logic, which I don't like either. Maybe it is better if we are adding an explicit regex
argument to permit only True, False, and remove the len(pat)
logic, which only confuses things.
Of course need a release note and a versionchanged note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to get rid of the len(pat) logic and thus the regex=None
option, but I didn't want to introduce a breaking change.
If you guys want me to go ahead, what should the regex
default be? I don't want to break anyone's existing code. For example, if we make regex
default to True
, it will break someone's code splitting on a period (pat="."
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback what do you think on breaking change here. I would be inclined to set the default to True
, since it uses Regex in every case where the pattern length is is greater than 1, and in most cases where the pattern length is 1, then regex would give the same result as string anyway, so it might only break in a few cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that removing this logic would be great, I'd be worried about changing the default to True
without a deprecation. While most cases won't be affected like you mention, I'd imagine many users (who may not even know what a regex is) often split on punctuation like "." or "-". Many of the issues referenced here (or others for replace
) stem from the user not realizing that the pattern is being treated as regex - I'd guess we'd see a bunch more of those if the default is changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is a new method we could change things, but yeah maybe this is too much for now.
ok what i think we should do is this.
factor to a common method and use it here. when we deprecate this it will deprecate in both places (yes even though this is a new method it is fine). its at least consistent.
cc @attack68 if any comments |
lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM.
Before merging I think it would be worth just revisiting if we want to get rid of the pattern length 1 logic - its pretty ropey, and there is a good opportunity here. |
Yes, once we have a Some issues refer to consistency with the Firstly, So that would require that the default is to always treat Secondly, add the functionality of re.split to the pandas method without affecting the above. There are two ways to use the split capability of the Python So I am happy that a By accepting a compiled regex we are now implicitly adding the Adding And also adding a I think these points are covered by https://github.com/pandas-dev/pandas/pull/44185/files#r739415484 and can be done as a follow-up. @attack68 anything else preventing merging this Thanks @saehuihwang for the PR. |
no , think thats a great plan and lgtm |
… passed in compiled regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
thanks @saehuihwang very nice. would like to refactor the inner code if possible to have all of the regex handling in one place if you are interested. |
str.split
#32835I've preserved current behavior, in which regex = None. Currently, it handles the pattern as a regex if the length of pattern is not 1. I believe that in the future, this may be worth considering deprecating.