-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add split_inclusive()
to API
#1096
base: master
Are you sure you want to change the base?
Conversation
For CI rustfmt, your lines are waaaaaaaaay too long. I'd suggest just running For the test failures, you'll want to use |
So looking at this, the behavior does not seem to match what What you've implemented here is not really a split "inclusive" (the "inclusive" in std's The confusing part here is that #681 asks for this enum based approach but calls it So what to do? I think a |
@BurntSushi Hey again, I agree that it doesn't make much sense to replicate the behavior of So I think we should name it something like As for the return type, I would keep it the same as let arr = ...; // regex-automata/src/meta/regex.rs#L3716
for (pattern, text, _) in arr {
println!("Pattern: {:?}, Text: {:?}", pattern, text);
for (i, sp) in Regex::new(pattern).unwrap().split_inclusive(text).enumerate() {
let is_match = i % 2 != 0;
println!("Is match: {:?}, Token: {:?}", is_match, &text[sp]);
}
println!();
}
I think it would be better if we add a new method to the struct |
I don't know what captures have to do this. There shouldn't be anything about captures related to splitting. Your boolean toggle via You're right that we could add a method on the iterator type itself to add a boolean that correctly indicates a match or not, but I do not like that idea at all. I don't like booleans in general. I'd much rather an enum with one variant corresponding to a |
Im not familiar with adjacent matches in RegEx. Would you be able to provide a test case or example? |
@shner-elmo Which raises the question of whether it should emit My sense is that this is the right behavior:
|
@BurntSushi Thanks, I'll be away for the weekend, but I'll read it properly when I'm back. Have a nice day. |
Hey @BurntSushi, sorry for the late reply. imo the behavior should be very simple: the current behavior of I think it should be up to the user to filter the empty non-matching spans.
For me the default behavior is what other libraries are already doing, to not confuse its users. This is how Python handles it: >>> import re
>>> re.split(r'a', 'aaa')
['', '', '', '']
>>> re.split(r'(a)', 'aaa')
['', 'a', '', 'a', '', 'a', ''] Are there any major libraries that handle it like you described? |
This is a PR for something called fn main() {
let re = regex::Regex::new(r"a").unwrap();
let splits = re.split("aaa").collect::<Vec<_>>();
println!("{splits:?}");
} has this output:
Your second example utilizes capturing groups and is a result of this documented behavior in Python's
I find Python's behavior with capturing groups here to be quite baffling personally. And I don't really know what it has to do with this PR or even API. Capturing groups haven't been discussed at all to this point.
"should be very simple" isn't going to help us decide anything here.
This doesn't make any sense to me. I think we're at an impasse here. I'd suggest we close this PR and go back to the drawing board collecting specific use cases. |
I came here looking for split_inclusive behaviour with an example of DNA sequence, however, I solved it by joining the capturing group from captures_iter onto the resultant string from split and therefore split_inclusive wasn't necessary. |
As mentioned in issues: #285, #330, and #681, there is a strong demand for a split function that includes the delimiter (the regex match) inside the output, i.e.:
This pull request does exactly this.
The implementation is very similar to
split()
, except that the iterator ofinclusive_split()
will return two elements for each regex match,the first element is the start of the string up until the match, the second element is the start of the match to its end.
But of course, we can't return two elements at once (unless we want to return two-element tuples and then flatten it), so we will return the first, and the second will be saved in the field
span_to_yield
to be returned on the next call.I implemented the function and added some tests, please let me know what you think about the approach, if everything is good to go I will go ahead and add the documentation and examples.
Note on the tests:
Some of the test cases were copied from the GH issue, they're all tested in Python to make sure that the output of
split_inclusive()
is the same as Python'sre.split()
.I created a gist that has the same tests but in Python: https://gist.github.com/shner-elmo/7cd7c383fa882ab8cda743ec7a689b24
cheers