-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stricter syntax around repeat modifiers #765
Comments
This seems like the crux of the matter, but I don't understand why. Could you elaborate? I'll respond more fully to this issue once I have more time and I'm on my workstation. |
TL;DR - It's not really clear to me why we should adopt a "stricter" syntax here, other than consistency with other regex engines. I think the irony of this issue is that the regex crate probably has the most strict regex syntax of any of the libraries you bring up. For example, while this crate rejects (I'm not going to bother checking Hyperscan since I think it's supposed to match PCRE2 precisely, but more so because I don't have an easy way of running it right now.) So with that said, let's go through each of your regexes and I'll say more about them specifically.
The minimal version of this is There is nothing inherently wrong about repeating a boundary assertion. It just doesn't make much sense. However, actually banning all such non-sensical constructions is hard. Go, Python and PCRE2 ban maybe some of the more obvious ones, but it's only surface deep.
Again, the minimal regex here is
Again, same as above. Although there are some interesting differences here. e.g., PCRE2 accepts
They aren't bugs. I intentionally designed the syntax this way. It never made sense to me why these constructs were disallowed even though they were trivial to work around. It's possible that backtracking engines like Python and PCRE2 disallow things like
Correct. Even if I agree with you that these things should be disabled, they can't be because it would break backcompat. It would have to wait for regex 2, which I have no plans of releasing any time soon.
Again, this is the key point. As I asked above, I have no idea why you would want to refuse the syntax. There's really no practical harm that can come from, and even if you disabled the surface level versions like Python, Go and PCRE2, it would be trivial to work around.
I don't think there is anything in the doc that would suggest they are disallowed. It would seem weird to me to call out this specific case. I think it would be better in a document that tried to note the differences between the regex crate and other regex engines. #497 tracks that (and I've just added a line item about nonsensical repetitions). |
As a side note, I personally would be interested in the history of why some regex engines disallow things like |
Yes. I agree with OP that the examples have nonsensical constructs. But I agree with BurntSushi that any such decision would be arbitrary. But what matters a lot more is that I can take a regex used by one system, put it in Rust code and expect it to do the same thing. Currently this project seems to implement its own Yet Another Dialect. Maybe a reasonable solution is to adopt some existing dialect fully? Or at least in arbitrary questions like these, strive towards compatibility with one particular dialect? That leads to another arbitrary question of which dialect to follow. I think there should be a preference towards a dialect that has a specification and multiple implementations. EcmaScript ECMA-262 RegExp is one such spec, are there others? The ECMA-262 dialect is also used by the JSON Schema IETF draft (https://json-schema.org/understanding-json-schema/reference/regular_expressions.html) and at one point was mentioned in the OpenAPI spec (OAI/OpenAPI-Specification#1725) - although I think they're converging on JSON Schema now. |
It would be nice, but in practice, this is almost never true between any two regex engines.
No, it's not reasonable. And this suggestion seems like a major scope increase from the original topic of this issue. Questions like these basically amount to, "scrap the entire existing project, scrap the design goals, and do a complete rewrite." (I'm not sure if you realized this implication of your question, but either way, questions like that are frustrating to field.) If you need a regex engine implemented in Rust that precisely matches a spec, and you hold that goal above all others, then you need to go write your own independent regex engine.
Yes. POSIX. |
You're overreacting. My point wasn't that Rather, my point was that nobody benefits from frivolous differences between engines. Whether to accept So instead of debating what the correct answer should be, I think it makes more sense to look at what a standard has decided and just go with that.
I find that in practice, it's very often true. Obviously there are differences in implementation capabilities -- whether an engine supports features such as backreferences, lookaround, unbounded lookbehind, etc. But if a regex is supported, it generally behaves the same. This is from experience working with the almost 4000 regexes from https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos on 3 different implementations (Python's |
I agree.
This only works if the only question is that a simple decision needs to be made. But as I've said in previous comments in this issue, that isn't the only concern.
It's not. I'm sorry that I don't have the time or motivation to prove this statement wrong, but it is. While I did have "capabilities" in mind before, those aren't even remotely close to the only differences.
Well, hang on, yes, of course I agree! The word "generally" is a weasel word. I do not mean that in a bad way. I mean that, yes, if a regex is supported, most regex engines are going to provide the same answer in the vast majority of cases. That fits with "generally." This issue is not about what "generally" works. This is very clearly about a corner case of a pathological syntactic difference. The word "generally" does not fit in this issue.
If I am, then it's only because I misunderstood you, or because you said something you didn't mean. Emphasis mine:
|
I've added the |
Yes, I did suggest that. Immediately following that, I also offered an alternative suggestion:
"Fully" would be my ideal world. But I would also be happy to see a tentative acknowledgement of some standard.
This got off track, I think we're in agreement here. Yes, it's likely that if you try hard enough, you can find differences between any two engines. My original statement ...
... was just trying to make the case that any change that reduces the differences between engines is an improvement for users. Not necessarily that implementations need to be 100% compatible. |
I can pretty confidently say that this will never happen. |
Upon re-visiting this issue in light of #847, I'm going to close it for the following reasons:
|
Hello and thank you for your work!
I work on different projects that use the regex crate intensively and I recently started doing some comparisons with other regex engines around different aspects, mainly syntax, performance and memory usage.
That comparison brought some attention of what could be considered "invalid" syntax that is currently accepted by this crate.
I will leave it to you as if those examples should be accepted or not:
^*\.google\.com$
: repetition of start anchor: accepted by golang regex engine, refused by hyperscan, refused by python (same question applies for other kind of repeat modifiers like?
,+
)a**\.google\.com$
: multiple consecutive same repeat modifier (you can specify as many): refused by golang, refused by hyperscan, refused by python (same question applies to other repeat modifiers like?
and+
except that??
has a special meaning).a*+?*+?\.google\.com$
: multiple consecutive different repeat modifiers: refused by golang, refused by hyperscan, refused by python. According to the documentation maybe only*?
,+?
and??
should be accepted?Version tested:
re.compile
)I think PCRE2 will accept 2 repeat modifiers like
**
but not more, however I haven't verified those examples against PCRE2 yet.If those examples are considered bugs, would it fixable? There might be backward compatibility concerns here?
If they are not considered bugs, maybe it could be explained/detailed in the doc?
In any case, I would want to refuse such syntax in the systems I operate.
According to you, what would be the best approach here? Would you recommend using the regex-syntax crate to parse and analyze the AST to detect such construct?
Thank you for your time :)
The text was updated successfully, but these errors were encountered: