Skip to content

Conversation

IsaacOscar
Copy link
Contributor

@IsaacOscar IsaacOscar commented Sep 27, 2025

This makes some of the changes I suggested in #769.
You should be able to review the changes now, all of which are detailed in the commit messages.
However, I have not done the following:

  1. Add test cases to the last commit (so this commit may not work correctly!)
  2. Add test cases for the match-data->rc field (changed by 1796fb2, however this changes is relied upon by test cases in later commits, and is not user-visible).
  3. Add a check that the options passed to pcre2_match and pcre2_substitute are compatible.
  4. Add/change documentation
  5. Put the new tests in an appropriate file (they're all just in a testdata/testinputNEW and testdata/testoutputNEW file, as I wasn't sure which test file to actually put them in).

Also the commit 6e3a6a5 that adds a -c option to pcre2test is not really related, it's just something I added to help me debug stuff, so I'm happy to remove it or put it in a different pull request if you want.

@IsaacOscar IsaacOscar marked this pull request as draft September 27, 2025 10:20
@IsaacOscar
Copy link
Contributor Author

I would like to add documentation explaining exactly when PCRE2 may return Invalid UTF, and when it won't.

There are various options affecting this, but my understanding from reading the documentation is that the rules are as follows:

  1. I will use the term input to mean the subject of pcre2_match, pcre2_dfa_match, pcre2_jit_match, or pcre2_substitute, or the replacement string of pcre2_substitute).
  2. I will say that a flag is on if it is in effect. regardless of whether it was set in the regex, or passed as a flag to pcre2_match, or some other function.
  3. If the PCRE2_UTF flag is not on, no checks or guarantees occur, and none of the following points apply.
  4. If the PCRE2_NO_UTF_CHECK flag is on and the PCRE2_MATCH_INVALID_UTF each input string is assumed to be valid UTF, if it is not, it is undefined behaviour (and so none of the following points apply)
  5. If the match/substitute produced an error, none of the following points apply.
  6. If the PCRE2_MATCH_INVALID_UTF flag is:
    i. off: each input string is guaranteed to be valid UTF
    ii. on: the replacement string (if any) is guaranteed to be valid UTF
  7. If \C:
    i. is not used in the pattern: the contents of each capture group is guaranteed to be valid UTF
    ii. is used in the pattern: a capture group may start or end in the middle of a UTF character, but all characters in between will be valid UTF. (this is probably not important to mention)

My last commit ensures that the subject and matchdata passed to pcre2_substitute when used with PCRE2_SUBSTITUTE_MATCHED satisfies the above (so you can't use it with UTF that is "more" invalid than was possible from a successful call to pcre2_match).

@NWilson
Copy link
Member

NWilson commented Sep 27, 2025

Thank you very much Isaac! I spent today with family and haven't looked at your PR yet. I will review it in the next day or two.

@NWilson
Copy link
Member

NWilson commented Sep 29, 2025

There are a lot of changes here for one PR.

Thank you!

What I have started doing is working through it commit-by-commit, doing a little testing of my own and cherry-picking the commits, starting with the simplest ones.

@NWilson
Copy link
Member

NWilson commented Sep 29, 2025

I like the changes so far, thanks!

I have merged several of them via cherry-pick.

Unfortunately, some changes I merged last week have generated tons of conflicts for your -c colorise option. It looks like a nice change, but because it's non-essential, I'm just going to mentally put that commit to one side, and come back to it later. Since you've done the work however, I would like to take the commit, after we've finished on pcre2_substitute.

I'm a bit unsure about the BACKCHAR changes. What you've done isn't wrong exactly, but I'm actually hoping to get rid of the places where we go backwards in the subject string. In future, I want to improve PCRE2's ability to handle binary (invalid) input. It certainly is possible to go backwards through invalid UTF-8, but it's actually a rather more complicated loop that involves going back, then forwards again if the bytes were not a valid character. I would therefore prefer to avoid introducing any new calls to BACKCHAR.

I will think about how to do this best in pcre2_substitute.

@IsaacOscar
Copy link
Contributor Author

I'm a bit unsure about the BACKCHAR changes. What you've done isn't wrong exactly, but I'm actually hoping to get rid of the places where we go backwards in the subject string. In future, I want to improve PCRE2's ability to handle binary (invalid) input. It certainly is possible to go backwards through invalid UTF-8, but it's actually a rather more complicated loop that involves going back, then forwards again if the bytes were not a valid character. I would therefore prefer to avoid introducing any new calls to BACKCHAR.

My last commit (which I'm in the process of writing test cases for) uses BACKCHAR and FORWARDCHAR. I'm happy to change that to use ACCROSSCHAR, or something else if you'd prefer.

@IsaacOscar
Copy link
Contributor Author

Unfortunately, some changes I merged last week have generated tons of conflicts for your -c colorise option

Woops, I should've done a rebase. I'm happy to fix this though. (One of my later commits adds some more cprintf calls, but they can be easily changed to fprintf).

@NWilson
Copy link
Member

NWilson commented Sep 29, 2025

Regarding your numbered comment above:

  • I think you are correct on all points
  • If PCRE2_UTF is ON, then the input is treated as UTF (UTF-8 for the 8-bit library); otherwise it is one-byte-to-one-character (basically Latin-1).
  • If PCRE2_MATCH_INVALID_UTF is ON, then it is allowed for UTF subject strings to be invalid. Otherwise it is a match error. Pattern strings must still be valid (if I recall correctly...) However, the matched substrings will still be valid UTF, because PCRE2 just skips right over the "binary" (invalid) sections of the input subject.
  • If PCRE2_NO_UTF_CHECK is ON, then PCRE2 does not do any runtime check that the input is UTF! However it still requires it, and demons may fly out of your nose if you do actually pass in invalid UTF-8. This is not theoretical - you can most definitely generate some lovely out-of-bounds reads (as detected by valgrind) if you tell PCRE2 that the input is valid, but it isn't, and you force PCRE2 to bypass its validation with PCRE2_NO_UTF_CHECK.
  • The replacement string of pcre2_substitute is its own case, in terms of input validity checking.

I also think we should maybe not obsess too much about documenting the current state. I'm increasingly drawn towards a bit of a cleanup of things, to allow PCRE2 to properly/natively handle invalid UTF input throughout. The \C behaviour is dangerous and we're already telling people not to use it, but we could definitely fix it all up and make all the text processing binary-safe throughout. I basically want the core PCRE2 engine to work with a lenient-UTF engine, which can handle any binary buffer of input, but treats legal UTF characters as one unit. (And there would also be the Latin-1 mode, using a one-byte-to-one-character codec.) And finally the validation options simply control whether invalid UTF-8 bytes are treated as an early-exit match error, or matched as if they were U+FFFD.

So I'm tempted not to go wild on this, but just to add a sensible level of safety checking to pcre2_substitute, which would have an immediate benefit. We can add a bit of documentation that says "if you do XYZ your output is guaranteed valid", but we don't have to exhaustively say "the output is valid if and only if you do XYZ and UVW but not ABC".

@NWilson
Copy link
Member

NWilson commented Sep 29, 2025

My last commit (which I'm in the process of writing test cases for) uses BACKCHAR and FORWARDCHAR. I'm happy to change that to use ACCROSSCHAR, or something else if you'd prefer.

I'll just work through your lovely stack of commits one-by-one, and request changes if I want them.

Don't do any extra work now, it'll just take me a couple of days probably to process this.

@IsaacOscar
Copy link
Contributor Author

Don't do any extra work now, it'll just take me a couple of days probably to process this.

Ok, but I still want to finish the test cases for that commit, and add a check for options being the same in the calls to match and substitute. But I won't change anything else.

This also makes RunTest check for memory leaks when using -valgrind.
Specifically, these occured when using PCRE2_COPY_MATCHED_SUBJECT and PCRE2_SUBSTITUTE_MATCHED.
Previously, match_data->subject would store a pointer to a stack
allocated string, but that pointer would be dangling as soon as
pcre2_match/pcre2_dfa_match returned. These pointers were than memcpy'd
(with a size of 0), which is technically undefined behaviour, despite
ot causing any observable problems, even with valgrind.
BACKCHARTEST is a cross between BACKCHAR and FORWARDCHARTEST.
This replaces all uses of ACCROSSCHAR with FORWARDCHARTEST or
BACKWARDCHARTEST.
This makes match_data->rc after a call to pcre2_match, pcre2_jit_match, and
pcre2_dfa_match nire reliable, so that pcre2_substitute with
PCRE2_SUBSTITUTE_MATCHED can abort early.
In particular, pcre2test will colour as follows:
 * Comments from the inputfile are in grey (but not those entered in
   interactively)
 * All other input is in your terminals default colour
 * Messages related to PCRE2 api errors are in magenta
 * Messages related to errors with using pcre2test itself are in red
 * Timing and memory usage information is in blue
 * Normal output is in green
 * The interactive prompt is in blue
 * Anything withought an explicit colour set (e.g. stuff printed by
   valgrind) should be in yellow
The following subject modifiers have been added:
* substitute_subject=<string>
    like replace=<string> but no [...] syntax
    (Note: zero_terminate does NOT apply to substitute_subject, see substitute_zero_terminate)
    This is for use with substitute_matched, it makes pcre2test use the given string for
    the call to pcre2_substitute (instead of the subject passed to pcre2_match)
* null_substitute_subject,
    For use with substitute_matched, this is just like null_subject and null_replacement,
    but applies to the subject paramater of pcre2_substitute.
* substitute_zero_terminate
    For use with substitute_subject, this is like zero_terminate, but applies to the call to pcre2_substitute
    (and not the call to pcre2_match)
* substitute_overwrite:
    This is for use with substitute_subject, it causes the subject passed to pcre2_substitute
    to be located at the same meory address as the subject passed to pcre2_match.
    (But the data at that memory address will be modified to be the value of the substitute_subject modified)
* substitute_offset=<n>
    For use with substitute_matched, this gives a start offset to be passsed to pcre2_sbustitute,
    which can be different to the one passed to pcre2_match (which is set with offset/startoffset=<n>)
* substitute_options=<string>
    a possibly empty list of options seperated by |:
    	no_utf_check|no_jit|endanchored|notbol|noteol|notempty|notempty_atstart
    (i.e. any set of options that both pcre2_match and pcre2_substitute support)
    If this modifier is given, the options passed to pcre2_substitute will be the | of the given options,
    and all substitute-only options).
    This modifier can be set as a pattern modifier, and must be used with the substitute_matched option.
This adds three new error codes to pcre2_substitute when using
PCRE2_SUBSTITUTE_MATCHED:
* PCRE2_ERROR_DIFFERENT_SUBJECT: returned if if the subject pointer you passed is
  different from the prior call to pcre2_match
* PCRE2_ERROR_DIFFERENT_LENGTH: if the computed length differs from the
  pcre2_match call (i.e. after processing PCRE2_ZERO_TERMINATED)
* PCRE2_ERROR_DIFFERENT_OFFSET if the start offset differs from the
  pcre2_match call.
Specficially, when pcre2_match is called with PCRE2_COPY_MATCHED_SUBJECT,
and then pcre2_substitute is called with PCRE2_SUBSTITUTE_MATCHED,
the subject data saved by the PCRE2_COPY_MATCHED_SUBJECT call will be used,

In addition, the subject to pcre2_substitute can be NULL.
Moreover, the length can also be PCRE2_ZERO_TERMINATE, and the original
length will also be used.
This is usefull for testing invalid UTF with  the substitute_subject and
replace modifiers. (The normal subject, i.e. part before the \= must
still be valid)
Specficially, If the regex is in PCRE2_UTF mode, and pcre2_match didn't
use PCRE2_COPY_MATCHED_SUBJECT, and pcre2_substitute is using
PCRE2_SUBSTITUTE_MATCHED, the following checks are done:
* if the regex was not compiled with PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF:
  the subject is checked for UTF validity,
  returning a PCRE2_ERROR_UTF* error code on failure
* each capture capture group is checked to not be in an innapropriate UTF boundary
  returning the new error code PCRE2_ERROR_BADUTFCAPTURE on failure
* if the regex was not compiled with PCRE2_NO_UTF_CHECK, but it was compiled with PCRE2_MATCH_INVALID_UTF
  each capture group is checked entirely for UTF validty,
  returning the new error code PCRE2_ERROR_BADUTFCAPTURE on failure

in all failure cases above, the offset of the problematic code unit is stored in
outputlengthptr

TODO: test
@IsaacOscar
Copy link
Contributor Author

I've pushed an update, with the following changes:

  1. the last commit 1fa6cdf has some bug fixes and now contains tests for UTF-8 (in testdata/testinputNEW8).
  2. there's now a commit before the last one, 78291e0, which makes a change to pcre2test necessary for my UTF-8 tests to work (they contain lots of invalid utf-8).
  3. I fixed a bug in 26d3dee which added BACKCHARTEST (the 16-bit version had a while when it should have had an if).

I have not added UTF-16 and UTF-32 tests yet to the last commit as I haven't worked out how to get pcre2test to pass an invalid UTF-16/32 substitute_subject (I'll likely need to modify pcre2test further).

I have also noticed a couple of potential bugs in other stuff:

  1. I was trying to match the last 2-bytes of 😀 (which is 4 UTF-8 bytes), but the regex \C\C$ doesn't work, instead I had to use /\C\C\K(\C\C) or \C\C\K\C\C$.
  2. The following pcre2test program behaves different with JIT:
/()$/match_invalid_utf
    ∞\xF0\x9F\x98

With pcre2test -8 it reports "No match" (which Is what I expect, as the string doesn't end in valid UTF-8).
With pcre2test -8 -jit however it reports:

	 0: 
	 1: 

(this is the behaviour at commit ddb0df4, the version of master I've based my changes on).

@IsaacOscar
Copy link
Contributor Author

@NWilson, I see in #807 you're merging in my changes to pcre2test, so I won't make any more changes until you've finished with that.
I'll also hold off on writing any documentation until you've confirmed you're happy the user-visible behaviour changes I've made.

But to solve my UTF-16 and UTF-32 test case problem I mentioned above, I would like to change the substitute_subject (and replace while i'm at it) modifiers to support escape sequences just like the normal "subject" (and so it would be an error to put literal invalid UTF-8 bytes, instead you'd have to use the \x escape sequences).

@NWilson
Copy link
Member

NWilson commented Oct 1, 2025

Great, thank you.

But to solve my UTF-16 and UTF-32 test case problem I mentioned above, I would like to change the substitute_subject (and replace while i'm at it) modifiers to support escape sequences just like the normal "subject" (and so it would be an error to put literal invalid UTF-8 bytes, instead you'd have to use the \x escape sequences).

That would be potentially useful, good.

For the UTF checks - I've looked at what you've done, and I think they can be simplified quite a bit.

Instead of using valid_utf() to check entire substrings, we can instead do something much simpler:

def SLICES_CODEPOINT(i) = i < len && !ISLEADBYTE(subject[i])
for (i in range(0, ovector_length)) {
  if (SLICES_CODEPOINT(ovec[i]))
    return error;

@IsaacOscar
Copy link
Contributor Author

For the UTF checks - I've looked at what you've done, and I think they can be simplified quite a bit.

Instead of using valid_utf() to check entire substrings, we can instead do something much simpler:

def SLICES_CODEPOINT(i) = i < len && !ISLEADBYTE(subject[i])
for (i in range(0, ovector_length)) {
  if (SLICES_CODEPOINT(ovec[i]))
    return error;

I'm not sure specifically what tests you are suggesting I change, I have several of them for different cases as some form of invalid-UTF is sometimes possible due to PCRE2_MATCH_INVALID_UTF and \C.

You're check is certainly not sufficient as it doesn't prevent the capture group from ending too early (before a UTF-8 sequence is complete), or from containing total junk inside it.

@NWilson
Copy link
Member

NWilson commented Oct 1, 2025

You're check is certainly not sufficient as it doesn't prevent the capture group from ending too early (before a UTF-8 sequence is complete), or from containing total junk inside it.

Checking that the start and end of the capture group don't slice a codepoint is equivalent to checking that each offset points to a lead code unit (or the end-of-string pointer).

As for whether it contains junk - that's up to the caller. pcre2_substitute() is a string interpolation function, that relies on pcre2_match being called either externally or internally to slice the subject string. If a user does use pcre2_match themselves, but then mutates the buffer ... it's completely the user's responsibility to ensure that they are putting valid UTF-8 in the buffer.

To be honest, it's not clear that we need to do anything at all here to check. We're already being extremely pedantic to verify that the ovector points to valid offsets in the subject (given that we're already verified that the subject buffer is the exact same pointer and length).

Re-checking the subject buffer for UTF-8 validity just seems a little excessive.

@NWilson
Copy link
Member

NWilson commented Oct 1, 2025

In your Rust wrapper - you wanted the safety guarantees that:

  • If using bytes buffers, no out-of-bounds access will occur
  • If using str buffers (known safe UTF-8) then PCRE2 will produce valid UTF-8 output.

But scenario you're mentioning here goes beyond that:

  1. User uses a byte buffer, asks PCRE2 to validate that it's correct UTF-8 (sure, we can do that)
  2. User then mutates the byte buffer, and passes the same pointer in to pcre2_substitute
  3. User then expects us to re-validate for UTF-8 in pcre2_substitute

Basically, PCRE2 only validates for UTF-8 when it must (because internally we have to parse and decode). But pcre2_substitute doesn't do any parsing or interpretation of the subject string, it just uses the ovector to slice the subject up and paste it into the output buffer. So I'd prefer not to insert an unnecessary validation pass through the entire subject string, for a vanishingly rare corner case.

@IsaacOscar
Copy link
Contributor Author

IsaacOscar commented Oct 2, 2025

  • If using bytes buffers, no out-of-bounds access will occur

This is what I wanted: no undefined behaviour if I pass it a valid byte string, even if it's different from the call pcre2_match.

  • If using str buffers (known safe UTF-8) then PCRE2 will produce valid UTF-8 output.

This wasn't something I was doing (as I wasn't using strs), but would be useful to others.

But scenario you're mentioning here goes beyond that:

  1. User uses a byte buffer, asks PCRE2 to validate that it's correct UTF-8 (sure, we can do that)
  2. User then mutates the byte buffer, and passes the same pointer in to pcre2_substitute
  3. User then expects us to re-validate for UTF-8 in pcre2_substitute

Right. Because I was suggesting pcre2 to have a simple guarantee: if you use PCRE2_UTF, and don't use PCRE2_MATCH_INVALID_UTF, PCRE2_NO_UTF_CHECK, or \C, no matter what you do the output will always be valid UTF.
This is violated in the above scenario, because you've tricked PCRE2 into thinking a string is valid UTF when it's not.
Of course we can say in the documentation that the guarantee doesn't hold if you modify the subject between a call to pcre2_match and pcre2_substitute.

Basically, PCRE2 only validates for UTF-8 when it must (because internally we have to parse and decode). But pcre2_substitute doesn't do any parsing or interpretation of the subject string, it just uses the ovector to slice the subject up and paste it into the output buffer. So I'd prefer not to insert an unnecessary validation pass through the entire subject string, for a vanishingly rare corner case.

The extra check would also work for the following use case in Rust.

  1. You have a byte string that you haven't checked for UTF validity, you pass it to pcre2_match with PCRE2_UTF (and not PCRE2_MATCH_INVALID_UTF, PCRE2_NO_UTF_CHECK, or \C), which finds a match.
  2. Some time later you pass the same byte string to pcre2_substitute, but you haven't got the Rust type system to ensure it's the same string (as that is not easy to do in general).
  3. You then convert the output of pcre2_substitute to a UTF-8 str object.

Now that 3rd step will require either a UTF validity check, or unsafe Rust code to skip the check.
I guess I was thinking that having PCRE2 do the check on the input might be faster than doing it on the output. But that's really just a guess, so it's ok here if the Rust code has to do the check itself.

@IsaacOscar
Copy link
Contributor Author

You're check is certainly not sufficient as it doesn't prevent the capture group from ending too early (before a UTF-8 sequence is complete), or from containing total junk inside it.

Checking that the start and end of the capture group don't slice a codepoint is equivalent to checking that each offset points to a lead code unit (or the end-of-string pointer).

Woops, I misread your code as only checking the start of each capture group, but it's actually checking every element of the ovector which includes the start and end.

@IsaacOscar
Copy link
Contributor Author

IsaacOscar commented Oct 2, 2025

If we only want to do the minimal checks necessary to support the Rust str subject -> str output case (without requiring the Rust actually do a UTF-8 validity check on the output).
Then we can change my massive if block to the much simpler:

if (match_data->rc > PCRE2_ERROR_NOMATCH && utf && (match_data->flags & PCRE2_MD_COPIED_SUBJECT) == 0) &&
	/* Don't do any checks if \C was used */
	(code->flags & PCRE2_HASBKC) == 0)) 
	{
	for (int i = 0; i < 2*pairs; i++)
		{
			PCRE2_SIZE offset = match_data->ovector[i];
			if (offset == PCRE2_UNSET || offset >= length) continue;
			if (NOT_FIRSTCU(*offset))
				{
				*blength = offset; /* So the caller knows where the error occurred */
				rc = PCRE2_ERROR_BADUTFCAPTURE;
				goto EXIT;
				}
		}
	}

(And delete the now unnecessary BADUTFCAPTURE label).

I'll have to update the test cases of course though,

@NWilson
Copy link
Member

NWilson commented Oct 3, 2025

  1. I was trying to match the last 2-bytes of 😀 (which is 4 UTF-8 bytes), but the regex \C\C$ doesn't work, instead I had to use /\C\C\K(\C\C) or \C\C\K\C\C$.

I can guess where this comes from. I presume the bumpalong (which tries matching at each character position in turn) doesn't see" that the pattern starts with a \C, and that it should therefore attempt a match at each byte position.

In my opinion, the \C behaviour is just ... messy and I want to re-do it at some point in the future.

  1. The following pcre2test program behaves different with JIT:
/()$/match_invalid_utf
    ∞\xF0\x9F\x98

With pcre2test -8 it reports "No match" (which Is what I expect, as the string doesn't end in valid UTF-8).

Interesting. I would expect it to report a match. I guess the MATCH_INVALID_UTF support has forgotten about the case of empty matches at the end.

Similarly to \C, I want to totally re-do this flag at some point.

@NWilson
Copy link
Member

NWilson commented Oct 3, 2025

@IsaacOscar I have taken all your commits except for:

  • The -c one. We should take that out into its own PR, and maybe dial it down a bit to be more targetted to the output lines we're interested in highlighting
  • Remove UTF-8 validation for modifiers. Actually I'd prefer to keep UTF-8 validation but instead implement \x for replacement=<str> and substitute_subject=<str>.
  • The final commit, adding the UTF-8 validation for pcre2_substitute.

Would you like to do a rebase on this PR?

Or would you like me to cherry-pick myself... with the risk that I rework it a bit?

We're basically down to the last commit, and I'm looking to keep it as simple as possible, to sanity-check the subject string against the ovector in a strategic (limited) way. So hopefully, this final bit of work to merge can be kept fairly small.

We may not even need any support for invalid UTF in substitute_subject. We can just do a test with changing "zzz😀" to "😀zzz" and ensuring the ovector is rejected if it slices the mutated input.

@IsaacOscar
Copy link
Contributor Author

We're basically down to the last commit, and I'm looking to keep it as simple as possible, to sanity-check the subject string against the ovector in a strategic (limited) way. So hopefully, this final bit of work to merge can be kept fairly small.

We may not even need any support for invalid UTF in substitute_subject. We can just do a test with changing "zzz😀" to "😀zzz" and ensuring the ovector is rejected if it slices the mutated input.

@IsaacOscar I have taken all your commits except for:

  • The -c one. We should take that out into its own PR, and maybe dial it down a bit to be more targetted to the output lines we're interested in highlighting
  • Remove UTF-8 validation for modifiers. Actually I'd prefer to keep UTF-8 validation but instead implement \x for replacement=<str> and substitute_subject=<str>.
  • The final commit, adding the UTF-8 validation for pcre2_substitute.

Would you like to do a rebase on this PR?

Or would you like me to cherry-pick myself... with the risk that I rework it a bit?

We're basically down to the last commit, and I'm looking to keep it as simple as possible, to sanity-check the subject string against the ovector in a strategic (limited) way. So hopefully, this final bit of work to merge can be kept fairly small.

We may not even need any support for invalid UTF in substitute_subject. We can just do a test with changing "zzz😀" to "😀zzz" and ensuring the ovector is rejected if it slices the mutated input.

I that case I'll rebase to master and do point 1 and 3, and drop the commit supporting invalid UTF.

@NWilson
Copy link
Member

NWilson commented Oct 14, 2025

@IsaacOscar I started a branch the other day to take your final commit, with the UTF-8 validation.

However, I just felt uneasy about it, especially for the release I'm planning this week or next.

I'm struggling to provide precise guarantees for the interpreter & JIT on exactly how much UTF-8 validation is done, and the exact conditions that could cause ovector offsets not to point to valid start characters. There are several uses of BACKCHAR in the matching code which worry me a little.

I have added some assertions to explore, but I don't want to commit those without more certainty.

I think for now, I'm happy with the improvements you've done that are already merged, and at some point down the line, I'd be ready to revisit whether it's appropriate to add any UTF validation on the subject in pcre2_substitute().

Is that reasonable for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants