-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: fix str.replace('.','') should replace every character #24935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-06-13 00:02:24 UTC |
Codecov Report
@@ Coverage Diff @@
## master #24935 +/- ##
===========================================
- Coverage 92.38% 42.89% -49.49%
===========================================
Files 166 166
Lines 52388 52392 +4
===========================================
- Hits 48397 22472 -25925
- Misses 3991 29920 +25929
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24935 +/- ##
=========================================
Coverage ? 41.12%
=========================================
Files ? 179
Lines ? 50709
Branches ? 0
=========================================
Hits ? 20854
Misses ? 29855
Partials ? 0
Continue to review full report at Codecov.
|
8b8c210
to
4b9672a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are no tests for the deprecation
@@ -114,10 +114,8 @@ Conversion | |||
|
|||
Strings | |||
^^^^^^^ | |||
- Bug in :func:`Series.str.replace` not applying regex in patterns of length 1 (:issue:`24804`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can leave this here, but need a note on the Deprecation warning change.
@@ -456,9 +456,10 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True): | |||
flags : int, default 0 (no flags) | |||
- re module flags, e.g. re.IGNORECASE | |||
- Cannot be set if `pat` is a compiled regex | |||
regex : boolean, default True | |||
regex : boolean, default None | |||
- If True, assumes the passed-in pattern is a regular expression. | |||
- If False, treats the pattern as a literal string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a disconcerting change. You are essentially having a different default on what is being passed here. I would be ok with forcing the passing of regex
always, possibly in preparation of changing the default to regex=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand that @jreback it's a quite disruptive change. This change is following the proposal from @TomAugspurger in #24809 (comment) . The purpose is to make this argument more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be ok with forcing the passing of regex always, possibly in preparation of changing the default to regex=False
@jreback can you clarify what you mean by "forcing"? If we raise when regex
is not specified, that would be an API breaking change.
My proposal is the preserve the previous behavior, but warn when a length-1 regex is detected. Then we can get to the documented behavior of regex=True
.
pandas/core/strings.py
Outdated
- If True, assumes the passed-in pattern is a regular expression. | ||
- If False, treats the pattern as a literal string | ||
- If pat is single special character, default regex is False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should mention something about a warning being issued when regex is not specified. Something like
- If `pat` is a single character and `regex` is not specified, `pat` is interpreted as a
string literal. If `pat` is also a regular expression symbol, a warning is issued that
in the future `pat` will be interpreted as a regex, rather than a literal.
pandas/core/strings.py
Outdated
@@ -577,6 +578,12 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True): | |||
if callable(repl): | |||
raise ValueError("Cannot use a callable replacement when " | |||
"regex=False") | |||
# if regex is default None, and a single special character is given | |||
# in pat, still take it as a literal, and raise the Future warning | |||
if regex is None and len(pat) == 1 and re.findall(r"\W", pat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the '\W'
here. I think that's not quite right for non-word characters that are not regex symbols. Something like @
Does python have a list of regex symbols?
Is the heuristic re.escape(pat) == pat
good enough?
In [46]: re.escape('@')
Out[46]: '@'
In [47]: re.escape('^')
Out[47]: '\\^'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger oh, indeed it's wrong! nice catch!! thanks for the correction!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test to ensure that something like Series.str.replace('@', 'at')
doesn't produce a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure! @TomAugspurger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ehh, @TomAugspurger escape somehow changed... wired
i did get same output the first time i tested on my jupyter notebook:
re.escape("@") -> "@"
but now, i cannot replicate this and get re.escape("@") -> "\\@"
...
so i will go for this option:
pat in list("[\^$.|?*+()]")
doc/source/whatsnew/v0.25.0.rst
Outdated
- | ||
- | ||
- | ||
- :func:`Series.str.replace`, the default regex for single character might be deprecated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be updated.
any follow-up code change requests and reviews? @jreback @TomAugspurger |
5916bc6
to
f778bde
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather simply go whole hog and just change regex=None
as a warning for all, meaning you must change to an explicit True/False. It makes it more clear. But this would then be a rather large change.
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -42,6 +42,7 @@ Other API Changes | |||
Deprecations | |||
~~~~~~~~~~~~ | |||
|
|||
- :func:`Series.str.replace`, when pat is single special regex character and regex is not defined, regex is by default False for now, but this might be deprecated in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs an issue reference; use double backticks on False
, use single backticks on keywordds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks changed!
6fa8c0f
to
b0dac6c
Compare
yes, indeed will be a kind of api change. shall I open another issue to make this change if other maintainers agree with this? Or do you want to do this change in this PR? @jreback |
7e55526
to
2c9f4fb
Compare
So as an alternate what if we simply detected the case where the pattern provided is a single / special regex character with |
@WillAyd based on your description, i think it's kind of what this PR is doing, right? |
I'm saying don't even update the signature of the function, simply do something like: if regex=True and pater in re_char_blacklist:
raise FutureWarning("Single character patterns will be evaluated as regexes in the future") Or something to the effect. Would minimize a lot of the volatility going on here |
@WillAyd the I (think?) then there is no way to get the desired future behavior without warning. IIUC, your prosposal would warn on pd.Series(['aa']).str.replace('.', 'b', regex=True) I think we'll need a period of
|
Yea sorry meant warn instead of raise. Why do we need regex=None at all? I was hoping to avoid that API change |
Does the rest of |
Hmm unless I'm mistaken my proposal would only affect option two. So something like: if regex=True and pater in {'.', '^', '$', '*', '+', '?', '}'}:
warnings.warn("You've passed a single special character in with regex=True which is evaluated as a single character. This will change in the future to match regex behavior") Warns the user about this behavior in the future but doesn't impact their current code, allowing for a smoother deprecation in the future. To your point though the result of option 2 would still stay 'aa' as is today just with the warning |
Right, so I think we understand each other.
My argument is that we shouldn't add a warning without a concrete action a
user can take to avoid it. Else it'll just annoy them :)
…On Thu, Feb 28, 2019 at 11:34 AM William Ayd ***@***.***> wrote:
Hmm unless I'm mistaken my proposal would only affect option two. So
something like:
if regex=True and pater in {'.', '^', '$', '*', '+', '?', '}'}:
warnings.warn("You've passed a single special character in with regex=True which is evaluated as a single character. This will change in the future to match regex behavior")
Warns the user about this behavior in the future but doesn't impact their
current code, allowing for a smoother deprecation in the future.
To your point though the result of option 2 would still stay 'aa' as is
today just with the warning
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24935 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhcVxvePpsXFK79RPv9m5d1NITnsks5vSBMugaJpZM4aTbKn>
.
|
Ah I gotcha. So in my mind if a user is passing a single special character and regex=True (whether explicitly or via the default argument) then it is a current "misuse" of that parameter. If the warning said something like "explicitly pass regex=False to maintain current behavior" would that be better? |
No worries. I'd recommend waiting, I don't think a decision has been made
yet.
…On Thu, Feb 28, 2019 at 2:07 PM Kaiqi Dong ***@***.***> wrote:
Many thanks for your feedback and discussions @TomAugspurger
<https://github.com/TomAugspurger> @WillAyd <https://github.com/WillAyd>
My English is not very good, and i might get distracted by your
discussions, could you give me any hint on what kind of change I need to
make (or give a general direction to make me a bit clearer)? as far as i
understand from your discussions, seems warning messages should be changed
a bit? @WillAyd <https://github.com/WillAyd> @TomAugspurger
<https://github.com/TomAugspurger> many thanks and sorry for asking this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24935 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIuj6ifkRxKYOEfYoCbJ-HchcPieaks5vSDbvgaJpZM4aTbKn>
.
|
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -43,6 +43,7 @@ Other API Changes | |||
Deprecations | |||
~~~~~~~~~~~~ | |||
|
|||
- :func:`Series.str.replace`, when pat is single special regex character and regex is not defined, regex is by default ``False`` for now, but this might be deprecated in the future. (:issue:`24804`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double back-ticks around pat
.
..is single special regex character'.. (such as .|\
etc).
@@ -456,9 +456,13 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True): | |||
flags : int, default 0 (no flags) | |||
- re module flags, e.g. re.IGNORECASE | |||
- Cannot be set if `pat` is a compiled regex | |||
regex : bool, default True | |||
regex : boolean, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default is still True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, you mean we now dont raise a warning anymore when char is a special single character while user doesn't explicitly specify regex
?
if set default to True
, and i use @TomAugspurger example:
pd.Series(['aa']).str.replace('.', 'b')
we will get:
default
or explicitlyregex=True
, output isbb
, no warningregex=False
, output isaa
, no warning.
is it correct? @WillAyd @TomAugspurger @jreback
2c9f4fb
to
1f97148
Compare
@WillAyd can you summarize your aversion to the (typical) use of None as a backwards compatible sentinel value for when we should warn? |
Not terribly against it if that’s our process
…Sent from my iPhone
On Mar 3, 2019, at 5:08 AM, Tom Augspurger ***@***.***> wrote:
@WillAyd can you summarize your aversion to the (typical) use of None as a backwards compatible sentinel value for when we should warn?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
can you merge master |
Hi, @jreback I just merged master and fix conflict! |
something still failing |
374b1c7
to
6a643ec
Compare
@jreback i just merged the master and fix test error |
This one has been around for a while. Just so we are all on the same page - is the intention here definitely to force users down the road to explicitly provide regex=True or regex=False? I know that was mentioned in comments above just not sure we agreed to it. An easy alternative would just be to raise a DeprecationWarning when a single special character gets passed and actually change behavior at a later date, which I think is what @TomAugspurger suggests in #24804 (comment) |
I don’t see a reason to make the regex flag required.
… On Apr 9, 2019, at 23:54, William Ayd ***@***.***> wrote:
This one has been around for a while. Just so we are all on the same page - is the intention here definitely to force users down the road to explicitly provide regex=True or regex=False? I know that was mentioned in comments above just not sure we agreed to it.
An easy alternative would just be to raise a DeprecationWarning when a single special character gets passed and actually change behavior at a later date, which I think is what @TomAugspurger suggests in #24804 (comment)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
ok, re-reading: #24809 (comment) that looks like a good idea, not sure where that is in implementation. |
@jreback thanks for your comment! could you explain a little bit more? I think the change I made in this PR was following @TomAugspurger suggestion in 24809 |
@charlesdong1991 can you merge master and we'll see where this is |
876c93a
to
0f172d8
Compare
@jreback I merged to master and resolved conflicts, pls take a look |
55efcd9
to
aaf9be3
Compare
I just reread through the issue, and my thinking has changed a bit. Do we think that If we think that most users will expect |
Definitely inconsistency IMO. I don't see that particular operation being useful but I don't think we should be special casing either, as our general handling of regexes in str replacement ops is all over the place |
What other inconsistencies do you see?
FWIW, I perfectly fine with documenting that length-1 `pat` aren't
auto-converted to regex, even when `regex` is true.
I think the state I want to avoid is surprising future users with `.`
replacing every character, just because they aren't familiar with regex.
So either we document that length-1 characters are special cased, or
deprecate `regex=True`. I don't have a good feeling for which is better.
…On Thu, Jun 13, 2019 at 3:51 PM William Ayd ***@***.***> wrote:
Or is the change primarily motivated by the inconsistency
Definitely inconsistency IMO. I don't see that particular operation being
useful but I don't think we should be special casing either, as our general
handling of regexes in str replacement ops is all over the place
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24935?email_source=notifications&email_token=AAKAOIRDAVDBAIYQXJOCRPLP2KXNDA5CNFSM4GSNWKT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXU727Y#issuecomment-501874047>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOISY433NZDDETRU4BRTP2KXNDANCNFSM4GSNWKTQ>
.
|
I wouldn't be opposed to really doing nothing on this atm. The behavior can definitely be surprising as noted by the OP of the original issue, but to your point I don't see why the strictly standard behavior is useful either. May be worth just punting this for a more comprehensive alignment of the regex keywords across string ops if it makes sense to tackle then |
By "doing nothing" do you include "document the current behavior of special
casing length-1 patterns"? If so, then I agree that's best for now :)
…On Thu, Jun 13, 2019 at 5:17 PM William Ayd ***@***.***> wrote:
I wouldn't be opposed to really doing nothing on this atm. The behavior
can definitely be surprising as noted by the OP of the original issue, but
to your point I don't see why the strictly standard behavior is useful
either.
May be worth just punting this for a more comprehensive alignment of the
regex keywords across string ops if it makes sense to tackle then
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24935?email_source=notifications&email_token=AAKAOISFQDIHWOVNJW3YD3LP2LBPNA5CNFSM4GSNWKT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXVF47I#issuecomment-501898877>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIW6JF43GV6CMPMKA3LP2LBPNANCNFSM4GSNWKTQ>
.
|
Agreed - @charlesdong1991 make sense to you? |
ok, it does. |
git diff upstream/master -u -- "*.py" | flake8 --diff