-
Notifications
You must be signed in to change notification settings - Fork 235
Description
I was reading the documentation of pcre2_substitute
in pcre2api
The relevant parts being (on both the website and my systems man pages, which is v10.45)
One such option is
PCRE2_SUBSTITUTE_MATCHED
. When this is set, an external match_data block must be provided, and it must have already been used for an external call topcre2_match()
with the same pattern and subject arguments.
The contents of the externally supplied match data block are not changed when
PCRE2_SUBSTITUTE_MATCHED
is set.
This is a bit limiting for several reasons:
-
It requires that the subject arguments (I assume the
PCRE2_SPTR subject
,PCRE2_SIZE length
, andPCRE2_SIZE startoffset
) be the same as the preceding call tomatch
. I am suggesting that this be relaxed, to only required that all the ranges in theovector
are within the bounds of thesubject
; I don't think there needs to be any additional requirement onstartoffset
(other than it being within the bounds of subject of course). -
It isn't clear whether you can call
pcre_substitute
withPCRE2_SUBSTITUTE_MATCHED
multiple times with the same match data. The second paragraph quoted above suggests that this is ok. I suggest clarifying that this is ok. -
Similarly, its not clear if a call to
pcre2_substitute
withoutPCRE2_SUBSTITUTE_MATCHED
counts as a call topcre2_match
. I also suggest allowing this.
I looked at the source code for my version of PCRE2 (the latest stable, 10.45). It seems that the above suggested changes are entirely consistent with the current implementation. I.e. I am asking only that the documentation to be updated so that one can rely on this behaviour in future releases.
My primary motivation for the above is a bit strange: I'm trying to make safe Rust wrapper over pcre2_substitute
:
- I want to be able to check the result of a match before deciding to do a substitute (i.e. something like
pcre2_match(....); if some_condition(match_data) { pcre2_substitute(...) }
). - I don't want to pay the performance cost of having to re-do the
pcre2_match
everytime you want to do a substitute - I don't want to pay the performance cost of unconditionally doing a
pcre2_substitute
, and then delete the substitution result (which will need to be dynamically allocated) if I decide not to do the substitution. - I want to minimise unsafe code: I don't want higher level code to get into undefined behaviour because it called
pcre2_substitute
with a different subject string. - I don't want to have to dynamically allocate a copy of the subject string and keep it around to prevent the above undefined behaviour.
- I want to be able able to reuse match data blocks for multiple calls to
pcre2_match
andpcre2_substitute
, thus saving memory and allocation time. I can use the Rust type system here to ensure that the match data is not used between the call topcre2_match
andpcre2_substitute
. - Trying to store a reference to the subject string so I can safely ensure the same one is passed to
pcre2_substitute
is a nightmare to do in generic Rust code as it needs to keep track of lifetimes (whereas the case with match data above is easy, as it doesn't "borrow" anything).. - Modifying the code of
pcre2_substitute
so it uses the subject stored in the match data, won't help me as getting Rusts's type system to ensure that the subject hasn't been freed or mutated between the two calls is particularly difficult.
So basically, if you do change #1 above, I can just store the length of the subject string with the match data, and require the user to pass the subject string when doing a substitute. I then just throw an error if the new subject is shorter than the original one. (I can similarly easily store the start offset).
Other use cases of course are:
- Doing some modification on the subject string between the calls to
pcre2_match
andpcre2_substitute
, e.g. uppercasing it. - Generating different replacement strings on the same subject (this requires points 1. and 2. above), e.g. you could use substitute once to title-case a string, and another to lowercase it.
I've also attached a simple test program that demonstrates that suggestions 1. and 2. currently work as expected (I also ran it through valgrind
and there were no errors, despite me having deleted the origin subject string I used when calling pcre2_match
).
Bassically, my program does:
- A search with regex
(l*)o(.*)l
on the stringhello world
, starting at position 3. - A substitution on the same string and starting offset with replacement
\n\t$_\n\t$
|$&|$'\n\t($1)\n\t($2)\n`. - I print the result of the above, yielding:
hel
hello world
hel|lo worl|d
(l)
( wor)
d
I.e. the first line is the text before the match (which is not replaced), the second line is the entire input string, the third is the text preceding the match, a |
, the match, a |
, and the text following it. The next two lines are the two captured groups in (...)
, and the final line is the text after the match (which is not replaced).
4. Free the original subject string, and delete the pointer to it in the match data (so I can be sure pcre2_substitute
wont access it in the next step).
5. Do step #2 above with the same replacement string, but subject totally different string
and start offset 1.
6. Print the result, which yields:
tot
totally different string
tot|ally di|fferent string
(a)
(ly d)
fferent string
Note that if the start offset given to pcre2_substitute
is larger than the start of the first match recorded by pcre2_match
, pcre2_substitute
returns PCRE2_ERROR_BADSUBSPATTERN
(i.e. match with end before start or start moved backwards is not supported
). However, using any other value for the start offset doesn't effect the result (so the error seems pointless: the start offset is only relevant for when pcre2_substitute
internally calls pcre2_match
, but it doesn't do that when PCRE2_SUBSTITUTE_MATCHED
is set, and PCRE2_SUBSTITUTE_GLOBAL
is not).