Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\K in lookbehind/lookahead should be always invalid #736

Open
mvorisek opened this issue Mar 26, 2025 · 5 comments
Open

\K in lookbehind/lookahead should be always invalid #736

mvorisek opened this issue Mar 26, 2025 · 5 comments
Labels
bug Something isn't working
Milestone

Comments

@mvorisek
Copy link

\K in lookbehind/lookahead should be always invalid as it affects "the global capture group" only, but lookbehind/lookahead does not participate in global capture group matching.

repro: https://3v4l.org/W9s8q

@NWilson
Copy link
Member

NWilson commented Mar 26, 2025

What is the "global capture group"? I assume you mean, the span of text matched by the expression (ovector[0] to ovector[1], or $matches[0]). Your example does not use "global matching" (that is, searching for all matches in a string).

Lookbehind and lookahead do participate in all match attempts.

The behaviour of \K inside lookaround assertions (and also scan_substring :) ) is extremely weird and unexpected. Since PCRE2 10.38, the default PCRE2 behaviour has changed to disallow \K inside assertions. However, there is a backwards-compatibility option to restore the historic behaviour, which we recommend against using.

PHP is clearly passing this flag. Perhaps the PHP developers would consider updating their behaviour to the new one. However, this would break any PHP code which was relying on using \K inside lookaround. In practice, I expect this would affect almost no-one.

PCRE2 did not want to force any breaking changes on users, no matter how small, so the PHP developers can choose whether to make this change in their own time, or never.

@mvorisek
Copy link
Author

mvorisek commented Mar 26, 2025

What is the "global capture group"? I assume you mean, the span of text matched by the expression (ovector[0] to ovector[1], or $matches[0]). Your example does not use "global matching" (that is, searching for all matches in a string).

Yes, I mean "global capture group" == $matches[0].

The behaviour of \K inside lookaround assertions (and also scan_substring :) ) is extremely weird and unexpected.

That is the point of this issue.

Since PCRE2 10.38, the default PCRE2 behaviour has changed to disallow \K inside assertions. However, there is a backwards-compatibility option to restore the historic behaviour, which we recommend against using.

PHP is clearly passing this flag. Perhaps the PHP developers would consider updating their behaviour to the new one. However, this would break any PHP code which was relying on using \K inside lookaround. In practice, I expect this would affect almost no-one.

What is the flag and can I list/check the flags on runtime? Is there some docs where can I see all flags with examples?

In practice, I expect this would affect almost no-one.

I see, https://3v4l.org/g4Ldi works. I am closing this issue as all is probably expected due BC reasons.

@NWilson
Copy link
Member

NWilson commented Mar 26, 2025

The flag is PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK.

Its effect is documented here: https://pcre2project.github.io/pcre2/doc/pcre2api/

I thought you had maybe seen that documentation already, since only last week I updated it to be much more precise in its description of PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK.

You should open an issue on the PHP bugtracker, to see if they would consider removing their use of this (semi-deprecated) flag.

@zherczeg
Copy link
Collaborator

My two cents: these flags have not much effect.

\K side effect without placing it into an assertion:

  re> /(?=.{10}(?1))x(\K){0}/
data> x1234567890
Start of matched string is beyond its end - displaying from end to start.
 0: 123456789

@NWilson
Copy link
Member

NWilson commented Mar 26, 2025

Ouch! That's very bad Zoltan. Hmm. We should fix that... somehow. It would be possible (but nasty) to do it at compile-time, building a graph of parts of the pattern visited by assertions, or else, doing it at run-time somehow...

Yuck yuck yuck.

@NWilson NWilson reopened this Mar 26, 2025
nielsdos pushed a commit to php/php-src that referenced this issue Mar 31, 2025
This option is semi-deprecated [1] and shouldn't influence much anyway.
The anticipated BC break is low.

[1] PCRE2Project/pcre2#736 (comment)
[2] PCRE2Project/pcre2#736 (comment)

Closes GH-18150.
@NWilson NWilson added the bug Something isn't working label Apr 2, 2025
@NWilson NWilson added this to the 10.46 milestone Apr 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants