Skip to content

Be less strict in the documentation for using PCRE2_SUBSTITUTE_MATCHED #769

@IsaacOscar

Description

@IsaacOscar

I was reading the documentation of pcre2_substitute in pcre2api

The relevant parts being (on both the website and my systems man pages, which is v10.45)

One such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external match_data block must be provided, and it must have already been used for an external call to pcre2_match() with the same pattern and subject arguments.

The contents of the externally supplied match data block are not changed when PCRE2_SUBSTITUTE_MATCHED is set.

This is a bit limiting for several reasons:

  1. It requires that the subject arguments (I assume the PCRE2_SPTR subject, PCRE2_SIZE length, and PCRE2_SIZE startoffset) be the same as the preceding call to match. I am suggesting that this be relaxed, to only required that all the ranges in the ovector are within the bounds of the subject; I don't think there needs to be any additional requirement on startoffset (other than it being within the bounds of subject of course).

  2. It isn't clear whether you can call pcre_substitute with PCRE2_SUBSTITUTE_MATCHED multiple times with the same match data. The second paragraph quoted above suggests that this is ok. I suggest clarifying that this is ok.

  3. Similarly, its not clear if a call to pcre2_substitute without PCRE2_SUBSTITUTE_MATCHED counts as a call to pcre2_match. I also suggest allowing this.

I looked at the source code for my version of PCRE2 (the latest stable, 10.45). It seems that the above suggested changes are entirely consistent with the current implementation. I.e. I am asking only that the documentation to be updated so that one can rely on this behaviour in future releases.

My primary motivation for the above is a bit strange: I'm trying to make safe Rust wrapper over pcre2_substitute:

  1. I want to be able to check the result of a match before deciding to do a substitute (i.e. something like pcre2_match(....); if some_condition(match_data) { pcre2_substitute(...) }).
  2. I don't want to pay the performance cost of having to re-do the pcre2_match everytime you want to do a substitute
  3. I don't want to pay the performance cost of unconditionally doing a pcre2_substitute, and then delete the substitution result (which will need to be dynamically allocated) if I decide not to do the substitution.
  4. I want to minimise unsafe code: I don't want higher level code to get into undefined behaviour because it called pcre2_substitute with a different subject string.
  5. I don't want to have to dynamically allocate a copy of the subject string and keep it around to prevent the above undefined behaviour.
  6. I want to be able able to reuse match data blocks for multiple calls to pcre2_match and pcre2_substitute, thus saving memory and allocation time. I can use the Rust type system here to ensure that the match data is not used between the call to pcre2_match and pcre2_substitute.
  7. Trying to store a reference to the subject string so I can safely ensure the same one is passed to pcre2_substitute is a nightmare to do in generic Rust code as it needs to keep track of lifetimes (whereas the case with match data above is easy, as it doesn't "borrow" anything)..
  8. Modifying the code of pcre2_substitute so it uses the subject stored in the match data, won't help me as getting Rusts's type system to ensure that the subject hasn't been freed or mutated between the two calls is particularly difficult.

So basically, if you do change #1 above, I can just store the length of the subject string with the match data, and require the user to pass the subject string when doing a substitute. I then just throw an error if the new subject is shorter than the original one. (I can similarly easily store the start offset).

Other use cases of course are:

  1. Doing some modification on the subject string between the calls to pcre2_match and pcre2_substitute, e.g. uppercasing it.
  2. Generating different replacement strings on the same subject (this requires points 1. and 2. above), e.g. you could use substitute once to title-case a string, and another to lowercase it.

I've also attached a simple test program that demonstrates that suggestions 1. and 2. currently work as expected (I also ran it through valgrind and there were no errors, despite me having deleted the origin subject string I used when calling pcre2_match).
Bassically, my program does:

  1. A search with regex (l*)o(.*)l on the string hello world, starting at position 3.
  2. A substitution on the same string and starting offset with replacement\n\t$_\n\t$|$&|$'\n\t($1)\n\t($2)\n`.
  3. I print the result of the above, yielding:
hel
        hello world
        hel|lo worl|d
        (l)
        ( wor)
d

I.e. the first line is the text before the match (which is not replaced), the second line is the entire input string, the third is the text preceding the match, a |, the match, a |, and the text following it. The next two lines are the two captured groups in (...), and the final line is the text after the match (which is not replaced).
4. Free the original subject string, and delete the pointer to it in the match data (so I can be sure pcre2_substitute wont access it in the next step).
5. Do step #2 above with the same replacement string, but subject totally different string and start offset 1.
6. Print the result, which yields:

tot
        totally different string
        tot|ally di|fferent string
        (a)
        (ly d)
fferent string

Note that if the start offset given to pcre2_substitute is larger than the start of the first match recorded by pcre2_match, pcre2_substitute returns PCRE2_ERROR_BADSUBSPATTERN (i.e. match with end before start or start moved backwards is not supported). However, using any other value for the start offset doesn't effect the result (so the error seems pointless: the start offset is only relevant for when pcre2_substitute internally calls pcre2_match, but it doesn't do that when PCRE2_SUBSTITUTE_MATCHED is set, and PCRE2_SUBSTITUTE_GLOBAL is not).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions