Be less strict in the documentation for using PCRE2_SUBSTITUTE_MATCHED


I was reading the documentation of  `pcre2_substitute` in [pcre2api](https://pcre2project.github.io/pcre2/doc/pcre2api)

The relevant parts being (on both the website and my systems man pages, which is v10.45)

> One such option is `PCRE2_SUBSTITUTE_MATCHED`. When this is set, an external match_data block must be provided, and it must have already been used for an external call to `pcre2_match()` with the same pattern and subject arguments. 

> The contents of the externally supplied match data block are not changed when `PCRE2_SUBSTITUTE_MATCHED` is set.

This is a bit limiting for several reasons:
1. It requires that the *subject* arguments (I assume the `PCRE2_SPTR subject`, `PCRE2_SIZE length`, and `PCRE2_SIZE startoffset`) be the same as the preceding call to `match`. I am suggesting that this be relaxed, to only required that all the ranges in the `ovector` are within the bounds of the `subject`; I don't think there needs to be any additional requirement on `startoffset` (other than it being within the bounds of subject of course).

2. It isn't clear whether you can call `pcre_substitute` with `PCRE2_SUBSTITUTE_MATCHED` multiple times with the same match data. The second paragraph quoted above suggests that this is ok. I suggest clarifying that this is ok.
3. Similarly, its not clear if a call to `pcre2_substitute` *without*  `PCRE2_SUBSTITUTE_MATCHED` counts as a call to `pcre2_match`. I also suggest allowing this.

I looked at the [source code](https://github.com/PCRE2Project/pcre2/blob/pcre2-10.45/src/pcre2_substitute.c) for my version of PCRE2 (the latest stable, 10.45). It seems that the above suggested changes are entirely consistent with the current implementation. I.e. I am asking only that the documentation to be updated so that one can rely on this behaviour in future releases.

My primary motivation for the above is a bit strange: I'm trying to make safe Rust wrapper over `pcre2_substitute`:
1. I want to be able to check the result of a match before deciding to do a substitute  (i.e. something like  `pcre2_match(....); if some_condition(match_data) { pcre2_substitute(...) }`). 
1. I don't want to pay the performance cost of having to re-do the `pcre2_match` everytime you want to do a substitute
3. I don't want to pay the performance cost of unconditionally doing a `pcre2_substitute`, and then delete the substitution result (which will need to be dynamically allocated) if I decide not to do the substitution.
4. I want to minimise unsafe code: I don't want higher level code to get into undefined behaviour because it called `pcre2_substitute` with a different subject string.
5. I don't want to have to dynamically allocate a copy of the subject string and keep it around to prevent the above undefined behaviour.
6. I want to be able able to reuse match data blocks for multiple calls to `pcre2_match` and `pcre2_substitute`, thus saving memory and allocation time. I can use the Rust type system here to ensure that the match data is not used between the call to `pcre2_match` and `pcre2_substitute`.
7. Trying to store a reference to the subject string so I can safely ensure the same one is passed to `pcre2_substitute` is a nightmare to do in generic Rust code as it needs to keep track of lifetimes (whereas the case with match data above is easy, as it doesn't "borrow" anything)..
8. Modifying the code of `pcre2_substitute` so it uses the subject stored in the match data, won't help me as getting Rusts's type system to ensure that the subject hasn't been freed or mutated between the two calls is particularly difficult.

So basically, if you do change #1 above, I can just store the length of the subject string with the match data, and require the user to pass the subject string when doing a substitute. I then just throw an error if the new subject is shorter than the original one. (I can similarly easily store the start offset).

Other use cases of course are:
1. Doing some modification on the subject string between the calls to `pcre2_match` and `pcre2_substitute`, e.g. uppercasing it.
2. Generating different replacement strings on the same subject (this requires points 1. and 2. above), e.g. you could use substitute once to title-case a string, and another to lowercase it.

I've also [attached ](https://github.com/user-attachments/files/21676320/PCRE2_SUBSTITUTE_MATCHED.c.txt) a simple test program that demonstrates that suggestions 1. and 2. currently work as expected (I also ran it through `valgrind` and there were no errors, despite me having deleted the origin subject string I used when calling `pcre2_match`).
Bassically, my program does:
1. A search with regex `(l*)o(.*)l` on the string `hello world`, starting at position 3.
2. A substitution on the same string and starting offset with replacement`\n\t$_\n\t$`|$&|$'\n\t($1)\n\t($2)\n`.
3. I print the result of the above, yielding:
```
hel
        hello world
        hel|lo worl|d
        (l)
        ( wor)
d
```
I.e. the first line is the text before the match (which is not replaced), the second line is the entire input string, the third is the text preceding the match, a `|`, the match, a `|`, and the text following it. The next two lines are the two captured groups in `(...)`, and the final line is the text after the match (which is not replaced).
4. Free the original subject string, and delete the pointer to it in the match data (so I can be sure `pcre2_substitute` wont access it in the next step).
5. Do step #2 above with the same replacement string, but subject `totally different string` and start offset 1.
6. Print the result, which yields:
```
tot
        totally different string
        tot|ally di|fferent string
        (a)
        (ly d)
fferent string
```

Note that if the start offset given to `pcre2_substitute` is *larger* than the start of the first match recorded by `pcre2_match`, `pcre2_substitute` returns `PCRE2_ERROR_BADSUBSPATTERN` (i.e. `match with end before start or start moved backwards is not supported`). However, using any other value for the start offset doesn't effect the result (so the error seems pointless: the start offset is only relevant for when `pcre2_substitute` internally calls `pcre2_match`, but it doesn't do that when `PCRE2_SUBSTITUTE_MATCHED` is set, and `PCRE2_SUBSTITUTE_GLOBAL` is not).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Be less strict in the documentation for using PCRE2_SUBSTITUTE_MATCHED #769

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Be less strict in the documentation for using PCRE2_SUBSTITUTE_MATCHED #769

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions