DRAFT: Add safety checks when using `PCRE2_SUBSTITUTE_MATCHED` #806

IsaacOscar · 2025-09-27T10:19:46Z

This makes some of the changes I suggested in #769.
You should be able to review the changes now, all of which are detailed in the commit messages.
However, I have not done the following:

Add test cases to the last commit (so this commit may not work correctly!)
Add test cases for the match-data->rc field (changed by 1796fb2, however this changes is relied upon by test cases in later commits, and is not user-visible).
Add a check that the options passed to pcre2_match and pcre2_substitute are compatible.
Add/change documentation
Put the new tests in an appropriate file (they're all just in a testdata/testinputNEW and testdata/testoutputNEW file, as I wasn't sure which test file to actually put them in).

Also the commit 6e3a6a5 that adds a -c option to pcre2test is not really related, it's just something I added to help me debug stuff, so I'm happy to remove it or put it in a different pull request if you want.

IsaacOscar · 2025-09-27T13:44:59Z

I would like to add documentation explaining exactly when PCRE2 may return Invalid UTF, and when it won't.

There are various options affecting this, but my understanding from reading the documentation is that the rules are as follows:

I will use the term input to mean the subject of pcre2_match, pcre2_dfa_match, pcre2_jit_match, or pcre2_substitute, or the replacement string of pcre2_substitute).
I will say that a flag is on if it is in effect. regardless of whether it was set in the regex, or passed as a flag to pcre2_match, or some other function.
If the PCRE2_UTF flag is not on, no checks or guarantees occur, and none of the following points apply.
If the PCRE2_NO_UTF_CHECK flag is on and the PCRE2_MATCH_INVALID_UTF each input string is assumed to be valid UTF, if it is not, it is undefined behaviour (and so none of the following points apply)
If the match/substitute produced an error, none of the following points apply.
If the PCRE2_MATCH_INVALID_UTF flag is:
i. off: each input string is guaranteed to be valid UTF
ii. on: the replacement string (if any) is guaranteed to be valid UTF
If \C:
i. is not used in the pattern: the contents of each capture group is guaranteed to be valid UTF
ii. is used in the pattern: a capture group may start or end in the middle of a UTF character, but all characters in between will be valid UTF. (this is probably not important to mention)

My last commit ensures that the subject and matchdata passed to pcre2_substitute when used with PCRE2_SUBSTITUTE_MATCHED satisfies the above (so you can't use it with UTF that is "more" invalid than was possible from a successful call to pcre2_match).

NWilson · 2025-09-27T20:33:18Z

Thank you very much Isaac! I spent today with family and haven't looked at your PR yet. I will review it in the next day or two.

NWilson · 2025-09-29T10:36:24Z

There are a lot of changes here for one PR.

Thank you!

What I have started doing is working through it commit-by-commit, doing a little testing of my own and cherry-picking the commits, starting with the simplest ones.

NWilson · 2025-09-29T13:19:56Z

I like the changes so far, thanks!

I have merged several of them via cherry-pick.

Unfortunately, some changes I merged last week have generated tons of conflicts for your -c colorise option. It looks like a nice change, but because it's non-essential, I'm just going to mentally put that commit to one side, and come back to it later. Since you've done the work however, I would like to take the commit, after we've finished on pcre2_substitute.

I'm a bit unsure about the BACKCHAR changes. What you've done isn't wrong exactly, but I'm actually hoping to get rid of the places where we go backwards in the subject string. In future, I want to improve PCRE2's ability to handle binary (invalid) input. It certainly is possible to go backwards through invalid UTF-8, but it's actually a rather more complicated loop that involves going back, then forwards again if the bytes were not a valid character. I would therefore prefer to avoid introducing any new calls to BACKCHAR.

I will think about how to do this best in pcre2_substitute.

IsaacOscar · 2025-09-29T13:23:16Z

I'm a bit unsure about the BACKCHAR changes. What you've done isn't wrong exactly, but I'm actually hoping to get rid of the places where we go backwards in the subject string. In future, I want to improve PCRE2's ability to handle binary (invalid) input. It certainly is possible to go backwards through invalid UTF-8, but it's actually a rather more complicated loop that involves going back, then forwards again if the bytes were not a valid character. I would therefore prefer to avoid introducing any new calls to BACKCHAR.

My last commit (which I'm in the process of writing test cases for) uses BACKCHAR and FORWARDCHAR. I'm happy to change that to use ACCROSSCHAR, or something else if you'd prefer.

IsaacOscar · 2025-09-29T13:24:56Z

Unfortunately, some changes I merged last week have generated tons of conflicts for your -c colorise option

Woops, I should've done a rebase. I'm happy to fix this though. (One of my later commits adds some more cprintf calls, but they can be easily changed to fprintf).

NWilson · 2025-09-29T13:36:41Z

Regarding your numbered comment above:

I think you are correct on all points
If PCRE2_UTF is ON, then the input is treated as UTF (UTF-8 for the 8-bit library); otherwise it is one-byte-to-one-character (basically Latin-1).
If PCRE2_MATCH_INVALID_UTF is ON, then it is allowed for UTF subject strings to be invalid. Otherwise it is a match error. Pattern strings must still be valid (if I recall correctly...) However, the matched substrings will still be valid UTF, because PCRE2 just skips right over the "binary" (invalid) sections of the input subject.
If PCRE2_NO_UTF_CHECK is ON, then PCRE2 does not do any runtime check that the input is UTF! However it still requires it, and demons may fly out of your nose if you do actually pass in invalid UTF-8. This is not theoretical - you can most definitely generate some lovely out-of-bounds reads (as detected by valgrind) if you tell PCRE2 that the input is valid, but it isn't, and you force PCRE2 to bypass its validation with PCRE2_NO_UTF_CHECK.
The replacement string of pcre2_substitute is its own case, in terms of input validity checking.

I also think we should maybe not obsess too much about documenting the current state. I'm increasingly drawn towards a bit of a cleanup of things, to allow PCRE2 to properly/natively handle invalid UTF input throughout. The \C behaviour is dangerous and we're already telling people not to use it, but we could definitely fix it all up and make all the text processing binary-safe throughout. I basically want the core PCRE2 engine to work with a lenient-UTF engine, which can handle any binary buffer of input, but treats legal UTF characters as one unit. (And there would also be the Latin-1 mode, using a one-byte-to-one-character codec.) And finally the validation options simply control whether invalid UTF-8 bytes are treated as an early-exit match error, or matched as if they were U+FFFD.

So I'm tempted not to go wild on this, but just to add a sensible level of safety checking to pcre2_substitute, which would have an immediate benefit. We can add a bit of documentation that says "if you do XYZ your output is guaranteed valid", but we don't have to exhaustively say "the output is valid if and only if you do XYZ and UVW but not ABC".

NWilson · 2025-09-29T13:45:50Z

My last commit (which I'm in the process of writing test cases for) uses BACKCHAR and FORWARDCHAR. I'm happy to change that to use ACCROSSCHAR, or something else if you'd prefer.

I'll just work through your lovely stack of commits one-by-one, and request changes if I want them.

Don't do any extra work now, it'll just take me a couple of days probably to process this.

IsaacOscar · 2025-09-29T13:49:42Z

Don't do any extra work now, it'll just take me a couple of days probably to process this.

Ok, but I still want to finish the test cases for that commit, and add a check for options being the same in the calls to match and substitute. But I won't change anything else.

This also makes RunTest check for memory leaks when using -valgrind.

Specifically, these occured when using PCRE2_COPY_MATCHED_SUBJECT and PCRE2_SUBSTITUTE_MATCHED.

Previously, match_data->subject would store a pointer to a stack allocated string, but that pointer would be dangling as soon as pcre2_match/pcre2_dfa_match returned. These pointers were than memcpy'd (with a size of 0), which is technically undefined behaviour, despite ot causing any observable problems, even with valgrind.

BACKCHARTEST is a cross between BACKCHAR and FORWARDCHARTEST.

This replaces all uses of ACCROSSCHAR with FORWARDCHARTEST or BACKWARDCHARTEST.

This makes match_data->rc after a call to pcre2_match, pcre2_jit_match, and pcre2_dfa_match nire reliable, so that pcre2_substitute with PCRE2_SUBSTITUTE_MATCHED can abort early.

In particular, pcre2test will colour as follows: * Comments from the inputfile are in grey (but not those entered in interactively) * All other input is in your terminals default colour * Messages related to PCRE2 api errors are in magenta * Messages related to errors with using pcre2test itself are in red * Timing and memory usage information is in blue * Normal output is in green * The interactive prompt is in blue * Anything withought an explicit colour set (e.g. stuff printed by valgrind) should be in yellow

The following subject modifiers have been added: * substitute_subject=<string> like replace=<string> but no [...] syntax (Note: zero_terminate does NOT apply to substitute_subject, see substitute_zero_terminate) This is for use with substitute_matched, it makes pcre2test use the given string for the call to pcre2_substitute (instead of the subject passed to pcre2_match) * null_substitute_subject, For use with substitute_matched, this is just like null_subject and null_replacement, but applies to the subject paramater of pcre2_substitute. * substitute_zero_terminate For use with substitute_subject, this is like zero_terminate, but applies to the call to pcre2_substitute (and not the call to pcre2_match) * substitute_overwrite: This is for use with substitute_subject, it causes the subject passed to pcre2_substitute to be located at the same meory address as the subject passed to pcre2_match. (But the data at that memory address will be modified to be the value of the substitute_subject modified) * substitute_offset=<n> For use with substitute_matched, this gives a start offset to be passsed to pcre2_sbustitute, which can be different to the one passed to pcre2_match (which is set with offset/startoffset=<n>) * substitute_options=<string> a possibly empty list of options seperated by |: no_utf_check|no_jit|endanchored|notbol|noteol|notempty|notempty_atstart (i.e. any set of options that both pcre2_match and pcre2_substitute support) If this modifier is given, the options passed to pcre2_substitute will be the | of the given options, and all substitute-only options). This modifier can be set as a pattern modifier, and must be used with the substitute_matched option.

This adds three new error codes to pcre2_substitute when using PCRE2_SUBSTITUTE_MATCHED: * PCRE2_ERROR_DIFFERENT_SUBJECT: returned if if the subject pointer you passed is different from the prior call to pcre2_match * PCRE2_ERROR_DIFFERENT_LENGTH: if the computed length differs from the pcre2_match call (i.e. after processing PCRE2_ZERO_TERMINATED) * PCRE2_ERROR_DIFFERENT_OFFSET if the start offset differs from the pcre2_match call.

Specficially, when pcre2_match is called with PCRE2_COPY_MATCHED_SUBJECT, and then pcre2_substitute is called with PCRE2_SUBSTITUTE_MATCHED, the subject data saved by the PCRE2_COPY_MATCHED_SUBJECT call will be used, In addition, the subject to pcre2_substitute can be NULL. Moreover, the length can also be PCRE2_ZERO_TERMINATE, and the original length will also be used.

This is usefull for testing invalid UTF with the substitute_subject and replace modifiers. (The normal subject, i.e. part before the \= must still be valid)

Specficially, If the regex is in PCRE2_UTF mode, and pcre2_match didn't use PCRE2_COPY_MATCHED_SUBJECT, and pcre2_substitute is using PCRE2_SUBSTITUTE_MATCHED, the following checks are done: * if the regex was not compiled with PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF: the subject is checked for UTF validity, returning a PCRE2_ERROR_UTF* error code on failure * each capture capture group is checked to not be in an innapropriate UTF boundary returning the new error code PCRE2_ERROR_BADUTFCAPTURE on failure * if the regex was not compiled with PCRE2_NO_UTF_CHECK, but it was compiled with PCRE2_MATCH_INVALID_UTF each capture group is checked entirely for UTF validty, returning the new error code PCRE2_ERROR_BADUTFCAPTURE on failure in all failure cases above, the offset of the problematic code unit is stored in outputlengthptr TODO: test

IsaacOscar · 2025-09-30T14:37:02Z

I've pushed an update, with the following changes:

the last commit 1fa6cdf has some bug fixes and now contains tests for UTF-8 (in testdata/testinputNEW8).
there's now a commit before the last one, 78291e0, which makes a change to pcre2test necessary for my UTF-8 tests to work (they contain lots of invalid utf-8).
I fixed a bug in 26d3dee which added BACKCHARTEST (the 16-bit version had a while when it should have had an if).

I have not added UTF-16 and UTF-32 tests yet to the last commit as I haven't worked out how to get pcre2test to pass an invalid UTF-16/32 substitute_subject (I'll likely need to modify pcre2test further).

I have also noticed a couple of potential bugs in other stuff:

I was trying to match the last 2-bytes of 😀 (which is 4 UTF-8 bytes), but the regex \C\C$ doesn't work, instead I had to use /\C\C\K(\C\C) or \C\C\K\C\C$.
The following pcre2test program behaves different with JIT:

/()$/match_invalid_utf
    ∞\xF0\x9F\x98

With pcre2test -8 it reports "No match" (which Is what I expect, as the string doesn't end in valid UTF-8).
With pcre2test -8 -jit however it reports:

	 0: 
	 1:

(this is the behaviour at commit ddb0df4, the version of master I've based my changes on).

IsaacOscar · 2025-10-01T14:15:16Z

@NWilson, I see in #807 you're merging in my changes to pcre2test, so I won't make any more changes until you've finished with that.
I'll also hold off on writing any documentation until you've confirmed you're happy the user-visible behaviour changes I've made.

But to solve my UTF-16 and UTF-32 test case problem I mentioned above, I would like to change the substitute_subject (and replace while i'm at it) modifiers to support escape sequences just like the normal "subject" (and so it would be an error to put literal invalid UTF-8 bytes, instead you'd have to use the \x escape sequences).

NWilson · 2025-10-01T15:16:07Z

Great, thank you.

But to solve my UTF-16 and UTF-32 test case problem I mentioned above, I would like to change the substitute_subject (and replace while i'm at it) modifiers to support escape sequences just like the normal "subject" (and so it would be an error to put literal invalid UTF-8 bytes, instead you'd have to use the \x escape sequences).

That would be potentially useful, good.

For the UTF checks - I've looked at what you've done, and I think they can be simplified quite a bit.

Instead of using valid_utf() to check entire substrings, we can instead do something much simpler:

def SLICES_CODEPOINT(i) = i < len && !ISLEADBYTE(subject[i])
for (i in range(0, ovector_length)) {
  if (SLICES_CODEPOINT(ovec[i]))
    return error;

IsaacOscar · 2025-10-01T15:21:59Z

For the UTF checks - I've looked at what you've done, and I think they can be simplified quite a bit.

Instead of using valid_utf() to check entire substrings, we can instead do something much simpler:
def SLICES_CODEPOINT(i) = i < len && !ISLEADBYTE(subject[i])
for (i in range(0, ovector_length)) {
  if (SLICES_CODEPOINT(ovec[i]))
    return error;

I'm not sure specifically what tests you are suggesting I change, I have several of them for different cases as some form of invalid-UTF is sometimes possible due to PCRE2_MATCH_INVALID_UTF and \C.

You're check is certainly not sufficient as it doesn't prevent the capture group from ending too early (before a UTF-8 sequence is complete), or from containing total junk inside it.

NWilson · 2025-10-01T18:33:37Z

You're check is certainly not sufficient as it doesn't prevent the capture group from ending too early (before a UTF-8 sequence is complete), or from containing total junk inside it.

Checking that the start and end of the capture group don't slice a codepoint is equivalent to checking that each offset points to a lead code unit (or the end-of-string pointer).

As for whether it contains junk - that's up to the caller. pcre2_substitute() is a string interpolation function, that relies on pcre2_match being called either externally or internally to slice the subject string. If a user does use pcre2_match themselves, but then mutates the buffer ... it's completely the user's responsibility to ensure that they are putting valid UTF-8 in the buffer.

To be honest, it's not clear that we need to do anything at all here to check. We're already being extremely pedantic to verify that the ovector points to valid offsets in the subject (given that we're already verified that the subject buffer is the exact same pointer and length).

Re-checking the subject buffer for UTF-8 validity just seems a little excessive.

NWilson · 2025-10-01T18:40:23Z

In your Rust wrapper - you wanted the safety guarantees that:

If using bytes buffers, no out-of-bounds access will occur
If using str buffers (known safe UTF-8) then PCRE2 will produce valid UTF-8 output.

But scenario you're mentioning here goes beyond that:

User uses a byte buffer, asks PCRE2 to validate that it's correct UTF-8 (sure, we can do that)
User then mutates the byte buffer, and passes the same pointer in to pcre2_substitute
User then expects us to re-validate for UTF-8 in pcre2_substitute

Basically, PCRE2 only validates for UTF-8 when it must (because internally we have to parse and decode). But pcre2_substitute doesn't do any parsing or interpretation of the subject string, it just uses the ovector to slice the subject up and paste it into the output buffer. So I'd prefer not to insert an unnecessary validation pass through the entire subject string, for a vanishingly rare corner case.

IsaacOscar · 2025-10-02T06:26:44Z

If using bytes buffers, no out-of-bounds access will occur

This is what I wanted: no undefined behaviour if I pass it a valid byte string, even if it's different from the call pcre2_match.

If using str buffers (known safe UTF-8) then PCRE2 will produce valid UTF-8 output.

This wasn't something I was doing (as I wasn't using strs), but would be useful to others.

But scenario you're mentioning here goes beyond that:

User uses a byte buffer, asks PCRE2 to validate that it's correct UTF-8 (sure, we can do that)

User then mutates the byte buffer, and passes the same pointer in to pcre2_substitute

User then expects us to re-validate for UTF-8 in pcre2_substitute

Right. Because I was suggesting pcre2 to have a simple guarantee: if you use PCRE2_UTF, and don't use PCRE2_MATCH_INVALID_UTF, PCRE2_NO_UTF_CHECK, or \C, no matter what you do the output will always be valid UTF.
This is violated in the above scenario, because you've tricked PCRE2 into thinking a string is valid UTF when it's not.
Of course we can say in the documentation that the guarantee doesn't hold if you modify the subject between a call to pcre2_match and pcre2_substitute.

Basically, PCRE2 only validates for UTF-8 when it must (because internally we have to parse and decode). But pcre2_substitute doesn't do any parsing or interpretation of the subject string, it just uses the ovector to slice the subject up and paste it into the output buffer. So I'd prefer not to insert an unnecessary validation pass through the entire subject string, for a vanishingly rare corner case.

The extra check would also work for the following use case in Rust.

You have a byte string that you haven't checked for UTF validity, you pass it to pcre2_match with PCRE2_UTF (and not PCRE2_MATCH_INVALID_UTF, PCRE2_NO_UTF_CHECK, or \C), which finds a match.
Some time later you pass the same byte string to pcre2_substitute, but you haven't got the Rust type system to ensure it's the same string (as that is not easy to do in general).
You then convert the output of pcre2_substitute to a UTF-8 str object.

Now that 3rd step will require either a UTF validity check, or unsafe Rust code to skip the check.
I guess I was thinking that having PCRE2 do the check on the input might be faster than doing it on the output. But that's really just a guess, so it's ok here if the Rust code has to do the check itself.

IsaacOscar · 2025-10-02T06:28:15Z

You're check is certainly not sufficient as it doesn't prevent the capture group from ending too early (before a UTF-8 sequence is complete), or from containing total junk inside it.

Checking that the start and end of the capture group don't slice a codepoint is equivalent to checking that each offset points to a lead code unit (or the end-of-string pointer).

Woops, I misread your code as only checking the start of each capture group, but it's actually checking every element of the ovector which includes the start and end.

IsaacOscar · 2025-10-02T06:35:27Z

If we only want to do the minimal checks necessary to support the Rust str subject -> str output case (without requiring the Rust actually do a UTF-8 validity check on the output).
Then we can change my massive if block to the much simpler:

if (match_data->rc > PCRE2_ERROR_NOMATCH && utf && (match_data->flags & PCRE2_MD_COPIED_SUBJECT) == 0) &&
	/* Don't do any checks if \C was used */
	(code->flags & PCRE2_HASBKC) == 0)) 
	{
	for (int i = 0; i < 2*pairs; i++)
		{
			PCRE2_SIZE offset = match_data->ovector[i];
			if (offset == PCRE2_UNSET || offset >= length) continue;
			if (NOT_FIRSTCU(*offset))
				{
				*blength = offset; /* So the caller knows where the error occurred */
				rc = PCRE2_ERROR_BADUTFCAPTURE;
				goto EXIT;
				}
		}
	}

(And delete the now unnecessary BADUTFCAPTURE label).

I'll have to update the test cases of course though,

NWilson · 2025-10-03T09:20:51Z

I was trying to match the last 2-bytes of 😀 (which is 4 UTF-8 bytes), but the regex \C\C$ doesn't work, instead I had to use /\C\C\K(\C\C) or \C\C\K\C\C$.

I can guess where this comes from. I presume the bumpalong (which tries matching at each character position in turn) doesn't see" that the pattern starts with a \C, and that it should therefore attempt a match at each byte position.

In my opinion, the \C behaviour is just ... messy and I want to re-do it at some point in the future.

The following pcre2test program behaves different with JIT:
/()$/match_invalid_utf
    ∞\xF0\x9F\x98
With pcre2test -8 it reports "No match" (which Is what I expect, as the string doesn't end in valid UTF-8).

Interesting. I would expect it to report a match. I guess the MATCH_INVALID_UTF support has forgotten about the case of empty matches at the end.

Similarly to \C, I want to totally re-do this flag at some point.

NWilson · 2025-10-03T09:30:16Z

@IsaacOscar I have taken all your commits except for:

The -c one. We should take that out into its own PR, and maybe dial it down a bit to be more targetted to the output lines we're interested in highlighting
Remove UTF-8 validation for modifiers. Actually I'd prefer to keep UTF-8 validation but instead implement \x for replacement=<str> and substitute_subject=<str>.
The final commit, adding the UTF-8 validation for pcre2_substitute.

Would you like to do a rebase on this PR?

Or would you like me to cherry-pick myself... with the risk that I rework it a bit?

We're basically down to the last commit, and I'm looking to keep it as simple as possible, to sanity-check the subject string against the ovector in a strategic (limited) way. So hopefully, this final bit of work to merge can be kept fairly small.

We may not even need any support for invalid UTF in substitute_subject. We can just do a test with changing "zzz😀" to "😀zzz" and ensuring the ovector is rejected if it slices the mutated input.

IsaacOscar · 2025-10-03T09:35:24Z

We're basically down to the last commit, and I'm looking to keep it as simple as possible, to sanity-check the subject string against the ovector in a strategic (limited) way. So hopefully, this final bit of work to merge can be kept fairly small.

We may not even need any support for invalid UTF in substitute_subject. We can just do a test with changing "zzz😀" to "😀zzz" and ensuring the ovector is rejected if it slices the mutated input.

@IsaacOscar I have taken all your commits except for:

The -c one. We should take that out into its own PR, and maybe dial it down a bit to be more targetted to the output lines we're interested in highlighting

Remove UTF-8 validation for modifiers. Actually I'd prefer to keep UTF-8 validation but instead implement \x for replacement=<str> and substitute_subject=<str>.

The final commit, adding the UTF-8 validation for pcre2_substitute.

Would you like to do a rebase on this PR?

Or would you like me to cherry-pick myself... with the risk that I rework it a bit?

We're basically down to the last commit, and I'm looking to keep it as simple as possible, to sanity-check the subject string against the ovector in a strategic (limited) way. So hopefully, this final bit of work to merge can be kept fairly small.

We may not even need any support for invalid UTF in substitute_subject. We can just do a test with changing "zzz😀" to "😀zzz" and ensuring the ovector is rejected if it slices the mutated input.

I that case I'll rebase to master and do point 1 and 3, and drop the commit supporting invalid UTF.

NWilson · 2025-10-14T13:59:32Z

@IsaacOscar I started a branch the other day to take your final commit, with the UTF-8 validation.

However, I just felt uneasy about it, especially for the release I'm planning this week or next.

I'm struggling to provide precise guarantees for the interpreter & JIT on exactly how much UTF-8 validation is done, and the exact conditions that could cause ovector offsets not to point to valid start characters. There are several uses of BACKCHAR in the matching code which worry me a little.

I have added some assertions to explore, but I don't want to commit those without more certainty.

I think for now, I'm happy with the improvements you've done that are already merged, and at some point down the line, I'd be ready to revisit whether it's appropriate to add any UTF validation on the subject in pcre2_substitute().

Is that reasonable for you?

IsaacOscar marked this pull request as draft September 27, 2025 10:20

IsaacOscar force-pushed the SUBSTITUTE_MATCHED branch from ac06d39 to 1eab04d Compare September 27, 2025 13:07

IsaacOscar force-pushed the SUBSTITUTE_MATCHED branch from 1eab04d to 8410fcc Compare September 27, 2025 14:00

IsaacOscar added 12 commits October 1, 2025 00:29

Fixed a memory leak in pcre2_substitute.

a2f7222

This also makes RunTest check for memory leaks when using -valgrind.

Fixed a couple of double frees.

6d5b688

Specifically, these occured when using PCRE2_COPY_MATCHED_SUBJECT and PCRE2_SUBSTITUTE_MATCHED.

Added a BACKCHARTEST macro.

26d3dee

BACKCHARTEST is a cross between BACKCHAR and FORWARDCHARTEST.

Remove ACCROSSCHAR.

327adb3

This replaces all uses of ACCROSSCHAR with FORWARDCHARTEST or BACKWARDCHARTEST.

Store return codes in match_data.

c8e8cc7

This makes match_data->rc after a call to pcre2_match, pcre2_jit_match, and pcre2_dfa_match nire reliable, so that pcre2_substitute with PCRE2_SUBSTITUTE_MATCHED can abort early.

Allow subject modifiers to contain invalid UTF bytes.

78291e0

This is usefull for testing invalid UTF with the substitute_subject and replace modifiers. (The normal subject, i.e. part before the \= must still be valid)

IsaacOscar force-pushed the SUBSTITUTE_MATCHED branch from 8410fcc to 1fa6cdf Compare September 30, 2025 14:30

DRAFT: Add safety checks when using PCRE2_SUBSTITUTE_MATCHED #806

Are you sure you want to change the base?

DRAFT: Add safety checks when using PCRE2_SUBSTITUTE_MATCHED #806

Uh oh!

Conversation

IsaacOscar commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IsaacOscar commented Sep 27, 2025

Uh oh!

NWilson commented Sep 27, 2025

Uh oh!

NWilson commented Sep 29, 2025

Uh oh!

NWilson commented Sep 29, 2025

Uh oh!

IsaacOscar commented Sep 29, 2025

Uh oh!

IsaacOscar commented Sep 29, 2025

Uh oh!

NWilson commented Sep 29, 2025

Uh oh!

NWilson commented Sep 29, 2025

Uh oh!

IsaacOscar commented Sep 29, 2025

Uh oh!

IsaacOscar commented Sep 30, 2025

Uh oh!

IsaacOscar commented Oct 1, 2025

Uh oh!

NWilson commented Oct 1, 2025

Uh oh!

IsaacOscar commented Oct 1, 2025

Uh oh!

NWilson commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NWilson commented Oct 1, 2025

Uh oh!

IsaacOscar commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IsaacOscar commented Oct 2, 2025

Uh oh!

IsaacOscar commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NWilson commented Oct 3, 2025

Uh oh!

NWilson commented Oct 3, 2025

Uh oh!

IsaacOscar commented Oct 3, 2025

Uh oh!

NWilson commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DRAFT: Add safety checks when using `PCRE2_SUBSTITUTE_MATCHED` #806

DRAFT: Add safety checks when using `PCRE2_SUBSTITUTE_MATCHED` #806

IsaacOscar commented Sep 27, 2025 •

edited

Loading

NWilson commented Oct 1, 2025 •

edited

Loading

IsaacOscar commented Oct 2, 2025 •

edited

Loading

IsaacOscar commented Oct 2, 2025 •

edited

Loading