Skip to content

BUG: not correct work str.split #43563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
ilgrad opened this issue Sep 14, 2021 · 6 comments · Fixed by #44185
Closed
1 of 3 tasks

BUG: not correct work str.split #43563

ilgrad opened this issue Sep 14, 2021 · 6 comments · Fixed by #44185
Assignees
Labels
API - Consistency Internal Consistency of API/Behavior Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Milestone

Comments

@ilgrad
Copy link

ilgrad commented Sep 14, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(columns=['url'])
df['url'] = ['https://qweqwe.com/2021-09-14/qweqwejpgqweqwe.jpg']
df['id'] = df['url'].str.split('/').str[-2].astype(str) + '_' + df['url'].str.split('/').str[-1].str.split('.jpg').str[0]
print(df['id'].iloc[0])

Issue Description

output: 2021-09-14_qweqw
true output: 2021-09-14_qweqwejpgqweqwe

Expected Behavior

2021-09-14_qweqwejpgqweqwe

Installed Versions

python 3.9
pandas version 1.3.2

@ilgrad ilgrad added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2021
@ilgrad ilgrad changed the title BUG: not correct work split BUG: not correct work str.split Sep 14, 2021
@debnathshoham
Copy link
Member

I can confirm this on master, and I think should work given it works on str.split.

In [23]: 'qweqwejpgqweqwe.jpg'.split(".jpg")[0]
Out[23]: 'qweqwejpgqweqwe'

For your case, you can add an escape character and it should work (.jpg -> \.jpg).

In [1]: import pandas as pd
   ...: 
   ...: df = pd.DataFrame(columns=['url'])
   ...: df['url'] = ['https://qweqwe.com/2021-09-14/qweqwejpgqweqwe.jpg']
   ...: df['id'] = df['url'].str.split('/').str[-2].astype(str) + '_' + df['url'].str.split('/').str[-1].str.split('\.jpg').str[0]
   ...: print(df['id'].iloc[0])
   ...: 
2021-09-14_qweqwejpgqweqwe

@asishm
Copy link
Contributor

asishm commented Sep 14, 2021

maybe the documentation could be improved here, but with .str.split uses regex where len(pat) > 1

OR an enhancement to include a regex=True parameter as it exists for str.replace

def _str_split(self, pat=None, n=-1, expand=False):
if pat is None:
if n is None or n == 0:
n = -1
f = lambda x: x.split(pat, n)
else:
if len(pat) == 1:
if n is None or n == 0:
n = -1
f = lambda x: x.split(pat, n)
else:
if n is None or n == -1:
n = 0
regex = re.compile(pat)
f = lambda x: regex.split(x, maxsplit=n)
return self._str_map(f, dtype=object)

@asishm
Copy link
Contributor

asishm commented Sep 14, 2021

also see related : #37963

@mzeitlin11
Copy link
Member

I like the idea of adding a regex arg since that would be consistent with replace. It feels arbitrary to choose whether or not to treat as a regex based on being a single character or not

@mzeitlin11 mzeitlin11 added Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data API - Consistency Internal Consistency of API/Behavior and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2021
@saehuihwang
Copy link
Contributor

take

@oltip
Copy link

oltip commented Oct 23, 2021

Thank you @mzeitlin11 for your prompt reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants