-
Notifications
You must be signed in to change notification settings - Fork 238
DRAFT: Add safety checks when using PCRE2_SUBSTITUTE_MATCHED
#806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ac06d39
to
1eab04d
Compare
I would like to add documentation explaining exactly when PCRE2 may return Invalid UTF, and when it won't. There are various options affecting this, but my understanding from reading the documentation is that the rules are as follows:
My last commit ensures that the subject and matchdata passed to |
1eab04d
to
8410fcc
Compare
Thank you very much Isaac! I spent today with family and haven't looked at your PR yet. I will review it in the next day or two. |
There are a lot of changes here for one PR. Thank you! What I have started doing is working through it commit-by-commit, doing a little testing of my own and cherry-picking the commits, starting with the simplest ones. |
I like the changes so far, thanks! I have merged several of them via cherry-pick. Unfortunately, some changes I merged last week have generated tons of conflicts for your I'm a bit unsure about the BACKCHAR changes. What you've done isn't wrong exactly, but I'm actually hoping to get rid of the places where we go backwards in the subject string. In future, I want to improve PCRE2's ability to handle binary (invalid) input. It certainly is possible to go backwards through invalid UTF-8, but it's actually a rather more complicated loop that involves going back, then forwards again if the bytes were not a valid character. I would therefore prefer to avoid introducing any new calls to BACKCHAR. I will think about how to do this best in pcre2_substitute. |
My last commit (which I'm in the process of writing test cases for) uses BACKCHAR and FORWARDCHAR. I'm happy to change that to use ACCROSSCHAR, or something else if you'd prefer. |
Woops, I should've done a rebase. I'm happy to fix this though. (One of my later commits adds some more cprintf calls, but they can be easily changed to fprintf). |
Regarding your numbered comment above:
I also think we should maybe not obsess too much about documenting the current state. I'm increasingly drawn towards a bit of a cleanup of things, to allow PCRE2 to properly/natively handle invalid UTF input throughout. The So I'm tempted not to go wild on this, but just to add a sensible level of safety checking to pcre2_substitute, which would have an immediate benefit. We can add a bit of documentation that says "if you do XYZ your output is guaranteed valid", but we don't have to exhaustively say "the output is valid if and only if you do XYZ and UVW but not ABC". |
I'll just work through your lovely stack of commits one-by-one, and request changes if I want them. Don't do any extra work now, it'll just take me a couple of days probably to process this. |
Ok, but I still want to finish the test cases for that commit, and add a check for options being the same in the calls to match and substitute. But I won't change anything else. |
This also makes RunTest check for memory leaks when using -valgrind.
Specifically, these occured when using PCRE2_COPY_MATCHED_SUBJECT and PCRE2_SUBSTITUTE_MATCHED.
Previously, match_data->subject would store a pointer to a stack allocated string, but that pointer would be dangling as soon as pcre2_match/pcre2_dfa_match returned. These pointers were than memcpy'd (with a size of 0), which is technically undefined behaviour, despite ot causing any observable problems, even with valgrind.
BACKCHARTEST is a cross between BACKCHAR and FORWARDCHARTEST.
This replaces all uses of ACCROSSCHAR with FORWARDCHARTEST or BACKWARDCHARTEST.
This makes match_data->rc after a call to pcre2_match, pcre2_jit_match, and pcre2_dfa_match nire reliable, so that pcre2_substitute with PCRE2_SUBSTITUTE_MATCHED can abort early.
In particular, pcre2test will colour as follows: * Comments from the inputfile are in grey (but not those entered in interactively) * All other input is in your terminals default colour * Messages related to PCRE2 api errors are in magenta * Messages related to errors with using pcre2test itself are in red * Timing and memory usage information is in blue * Normal output is in green * The interactive prompt is in blue * Anything withought an explicit colour set (e.g. stuff printed by valgrind) should be in yellow
The following subject modifiers have been added: * substitute_subject=<string> like replace=<string> but no [...] syntax (Note: zero_terminate does NOT apply to substitute_subject, see substitute_zero_terminate) This is for use with substitute_matched, it makes pcre2test use the given string for the call to pcre2_substitute (instead of the subject passed to pcre2_match) * null_substitute_subject, For use with substitute_matched, this is just like null_subject and null_replacement, but applies to the subject paramater of pcre2_substitute. * substitute_zero_terminate For use with substitute_subject, this is like zero_terminate, but applies to the call to pcre2_substitute (and not the call to pcre2_match) * substitute_overwrite: This is for use with substitute_subject, it causes the subject passed to pcre2_substitute to be located at the same meory address as the subject passed to pcre2_match. (But the data at that memory address will be modified to be the value of the substitute_subject modified) * substitute_offset=<n> For use with substitute_matched, this gives a start offset to be passsed to pcre2_sbustitute, which can be different to the one passed to pcre2_match (which is set with offset/startoffset=<n>) * substitute_options=<string> a possibly empty list of options seperated by |: no_utf_check|no_jit|endanchored|notbol|noteol|notempty|notempty_atstart (i.e. any set of options that both pcre2_match and pcre2_substitute support) If this modifier is given, the options passed to pcre2_substitute will be the | of the given options, and all substitute-only options). This modifier can be set as a pattern modifier, and must be used with the substitute_matched option.
This adds three new error codes to pcre2_substitute when using PCRE2_SUBSTITUTE_MATCHED: * PCRE2_ERROR_DIFFERENT_SUBJECT: returned if if the subject pointer you passed is different from the prior call to pcre2_match * PCRE2_ERROR_DIFFERENT_LENGTH: if the computed length differs from the pcre2_match call (i.e. after processing PCRE2_ZERO_TERMINATED) * PCRE2_ERROR_DIFFERENT_OFFSET if the start offset differs from the pcre2_match call.
Specficially, when pcre2_match is called with PCRE2_COPY_MATCHED_SUBJECT, and then pcre2_substitute is called with PCRE2_SUBSTITUTE_MATCHED, the subject data saved by the PCRE2_COPY_MATCHED_SUBJECT call will be used, In addition, the subject to pcre2_substitute can be NULL. Moreover, the length can also be PCRE2_ZERO_TERMINATE, and the original length will also be used.
This is usefull for testing invalid UTF with the substitute_subject and replace modifiers. (The normal subject, i.e. part before the \= must still be valid)
Specficially, If the regex is in PCRE2_UTF mode, and pcre2_match didn't use PCRE2_COPY_MATCHED_SUBJECT, and pcre2_substitute is using PCRE2_SUBSTITUTE_MATCHED, the following checks are done: * if the regex was not compiled with PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF: the subject is checked for UTF validity, returning a PCRE2_ERROR_UTF* error code on failure * each capture capture group is checked to not be in an innapropriate UTF boundary returning the new error code PCRE2_ERROR_BADUTFCAPTURE on failure * if the regex was not compiled with PCRE2_NO_UTF_CHECK, but it was compiled with PCRE2_MATCH_INVALID_UTF each capture group is checked entirely for UTF validty, returning the new error code PCRE2_ERROR_BADUTFCAPTURE on failure in all failure cases above, the offset of the problematic code unit is stored in outputlengthptr TODO: test
8410fcc
to
1fa6cdf
Compare
I've pushed an update, with the following changes:
I have not added UTF-16 and UTF-32 tests yet to the last commit as I haven't worked out how to get pcre2test to pass an invalid UTF-16/32 I have also noticed a couple of potential bugs in other stuff:
With
(this is the behaviour at commit ddb0df4, the version of master I've based my changes on). |
@NWilson, I see in #807 you're merging in my changes to But to solve my UTF-16 and UTF-32 test case problem I mentioned above, I would like to change the |
Great, thank you.
That would be potentially useful, good. For the UTF checks - I've looked at what you've done, and I think they can be simplified quite a bit. Instead of using def SLICES_CODEPOINT(i) = i < len && !ISLEADBYTE(subject[i])
for (i in range(0, ovector_length)) {
if (SLICES_CODEPOINT(ovec[i]))
return error; |
I'm not sure specifically what tests you are suggesting I change, I have several of them for different cases as some form of invalid-UTF is sometimes possible due to You're check is certainly not sufficient as it doesn't prevent the capture group from ending too early (before a UTF-8 sequence is complete), or from containing total junk inside it. |
Checking that the start and end of the capture group don't slice a codepoint is equivalent to checking that each offset points to a lead code unit (or the end-of-string pointer). As for whether it contains junk - that's up to the caller. pcre2_substitute() is a string interpolation function, that relies on pcre2_match being called either externally or internally to slice the subject string. If a user does use pcre2_match themselves, but then mutates the buffer ... it's completely the user's responsibility to ensure that they are putting valid UTF-8 in the buffer. To be honest, it's not clear that we need to do anything at all here to check. We're already being extremely pedantic to verify that the ovector points to valid offsets in the subject (given that we're already verified that the subject buffer is the exact same pointer and length). Re-checking the subject buffer for UTF-8 validity just seems a little excessive. |
In your Rust wrapper - you wanted the safety guarantees that:
But scenario you're mentioning here goes beyond that:
Basically, PCRE2 only validates for UTF-8 when it must (because internally we have to parse and decode). But pcre2_substitute doesn't do any parsing or interpretation of the subject string, it just uses the ovector to slice the subject up and paste it into the output buffer. So I'd prefer not to insert an unnecessary validation pass through the entire subject string, for a vanishingly rare corner case. |
This is what I wanted: no undefined behaviour if I pass it a valid byte string, even if it's different from the call
This wasn't something I was doing (as I wasn't using
Right. Because I was suggesting
The extra check would also work for the following use case in Rust.
Now that 3rd step will require either a UTF validity check, or |
Woops, I misread your code as only checking the start of each capture group, but it's actually checking every element of the ovector which includes the start and end. |
If we only want to do the minimal checks necessary to support the Rust if (match_data->rc > PCRE2_ERROR_NOMATCH && utf && (match_data->flags & PCRE2_MD_COPIED_SUBJECT) == 0) &&
/* Don't do any checks if \C was used */
(code->flags & PCRE2_HASBKC) == 0))
{
for (int i = 0; i < 2*pairs; i++)
{
PCRE2_SIZE offset = match_data->ovector[i];
if (offset == PCRE2_UNSET || offset >= length) continue;
if (NOT_FIRSTCU(*offset))
{
*blength = offset; /* So the caller knows where the error occurred */
rc = PCRE2_ERROR_BADUTFCAPTURE;
goto EXIT;
}
}
} (And delete the now unnecessary I'll have to update the test cases of course though, |
I can guess where this comes from. I presume the bumpalong (which tries matching at each character position in turn) doesn't see" that the pattern starts with a \C, and that it should therefore attempt a match at each byte position. In my opinion, the \C behaviour is just ... messy and I want to re-do it at some point in the future.
Interesting. I would expect it to report a match. I guess the MATCH_INVALID_UTF support has forgotten about the case of empty matches at the end. Similarly to \C, I want to totally re-do this flag at some point. |
@IsaacOscar I have taken all your commits except for:
Would you like to do a rebase on this PR? Or would you like me to cherry-pick myself... with the risk that I rework it a bit? We're basically down to the last commit, and I'm looking to keep it as simple as possible, to sanity-check the subject string against the ovector in a strategic (limited) way. So hopefully, this final bit of work to merge can be kept fairly small. We may not even need any support for invalid UTF in |
I that case I'll rebase to master and do point 1 and 3, and drop the commit supporting invalid UTF. |
@IsaacOscar I started a branch the other day to take your final commit, with the UTF-8 validation. However, I just felt uneasy about it, especially for the release I'm planning this week or next. I'm struggling to provide precise guarantees for the interpreter & JIT on exactly how much UTF-8 validation is done, and the exact conditions that could cause ovector offsets not to point to valid start characters. There are several uses of BACKCHAR in the matching code which worry me a little. I have added some assertions to explore, but I don't want to commit those without more certainty. I think for now, I'm happy with the improvements you've done that are already merged, and at some point down the line, I'd be ready to revisit whether it's appropriate to add any UTF validation on the subject in pcre2_substitute(). Is that reasonable for you? |
This makes some of the changes I suggested in #769.
You should be able to review the changes now, all of which are detailed in the commit messages.
However, I have not done the following:
match-data->rc
field (changed by 1796fb2, however this changes is relied upon by test cases in later commits, and is not user-visible).pcre2_match
andpcre2_substitute
are compatible.testdata/testinputNEW
andtestdata/testoutputNEW
file, as I wasn't sure which test file to actually put them in).Also the commit 6e3a6a5 that adds a
-c
option to pcre2test is not really related, it's just something I added to help me debug stuff, so I'm happy to remove it or put it in a different pull request if you want.