add (til) PEG special #1528

ianthehenry · 2024-12-04T06:00:43Z

This is... maybe not a real pull request; this is an idea with an implementation and some tests. But it's an idea that adds a bit of complexity to the PEG engine and is maybe not worth it.

I often want to write a PEG somewhere in between to and thru, usually while parsing something like key=value: capture everything up to =, skip over the =, and then match everything after. This is a little clumsy right now:

(* '(to "=") "=" '(to -1))

Because you have to repeat the separator (and even though this doesn't actually matter in any case where I've done this, it's a little sad that you evaluate the separator PEG one more time than is necessary).

So (til) is a special that makes this easy to write. It captures like (to), but advances like (thru).

And this is... weird. Nowhere else is there a PEG that captures and advances differently -- captures are defined by how they advance the input. Until now. So this PR adds complexity to PEG rules, as each PEG rule can basically return two things now: how far to advance, and how far to capture (if currently capturing).

I'm mostly hesitant about this because it's not 100% obvious how each other PEG special should interact with (til). I made judgment calls that I think are reasonable but I could see a case for e.g. either of these behaviors:

(check-deep ~''(til "=") "key=value" @["key" "key"])
(check-deep ~''(til "=") "key=value" @["key" "key="])

(I chose the latter because it's slightly simpler to implement and this will never come up in practice, but I feel slightly weird about it.)

I think this test summarizes (til) best:

(check-deep ~(* '(to "=") '(to -1)) "key=value" @["key" "=value"])
(check-deep ~(* '(til "=") '(til -1)) "key=value" @["key" "value"])
(check-deep ~(* '(thru "=") '(thru -1)) "key=value" @["key=" "value"])

src/core/peg.c

pepe · 2024-12-04T06:49:05Z

I have often encountered this pattern, so I can feel the pain here. Yet, I do not see it as a big problem. I do not understand all the changes in the PR correctly, but it feels like a lot of code to me :-D.

sogaiu · 2024-12-04T08:42:48Z

For the key=value example, it occurred to me do to this:

(peg/match ~(split "=" (capture (to -1))) 
           "key=value")
# =>
@["key" "value"]

This seems to avoid repeating "=" and is may be less sad?

May be there are other examples for which this type of approach is not so good perhaps?

Update: Another idea:

(peg/match ~(sequence (capture (to "="))
                      1
                      (capture (to -1))) 
           "key=value")
# =>
@["key" "value"]

ianthehenry · 2024-12-04T15:34:11Z

I pushed a new version with a slightly cleaner implementation (replace NULL with &ignore_capture_to_out and remove the NULL checks).

@sogaiu my example was pretty simple where the alternatives you propose would work, but imagine something more complicated, e.g.:

(* '(to :s+) :s+ (number :d+))
"foo 123"
"foo   123"

I don't know of a way to do this in general (with an arbitrary PEG as the separator) without this rule repetition.

This is also useful with sub, e.g.:

(* (sub (til "; ") (split ", " ':w+)) (number :d+))
"foo, bar, baz; 123"

Which is where I mostly wanted it in advent of code problems last year.

CFiggers · 2024-12-04T16:00:05Z

Zooming out on this one step further: wouldn't it be nice if the PEG module were extensible from within the language, so we could define new "combinators" like this one on a per-project basis without needing to modify the language as a whole.

That would allow for itches like this one to be solved on a per-project basis without needing to add baggage to every deployment of PEG across the entire Janet universe, and without requiring anyone to plumb around in the C to get a bespoke capture like this one working.

ianthehenry · 2024-12-04T16:05:16Z

Welllll you can already write helper combinators that use existing PEG machinery; this is (similarly to sub) an extension of what it means to be a PEG that is not expressible with the current machinery. I guess although it would be nice to be able to extend PEG engines dynamically, I can’t picture what that would look like in practice for a change like this one

sogaiu · 2024-12-05T03:41:13Z

I don't find the short forms readily comprehensible so apologies for "translating" below (I hope I didn't mess any up!).

For:

(* '(to :s+) :s+ (number :d+))
"foo 123"
"foo   123"

What came to mind for a peg to handle the two examples didn't involve an exact repetition of :s+:

(peg/match ~(sequence (capture (to :s)) 
                      :s+ 
                      (number :d+))
           "foo   123")
# =>
"foo   123"

Perhaps this particular case is a matter of how our perceptions happened to view things. (Just to be clear, I'm not trying to claim the peg I wrote above is better or anything.)

Regarding:

(* (sub (til "; ") (split ", " ':w+)) (number :d+))
"foo, bar, baz; 123"

Faced with this example, what surfaced here was:

(peg/match ~(sequence (sub (to ";")
                           (split ", " (capture :w+)))
                      1
                      :s+
                      (number :d+))
           "foo, bar, baz; 123")
# =>
@["foo" "bar" "baz" 123]

This is longer though more straight-forward to me (though I wrote it so not sure how much the latter point is worth).

I don't find this much of a length difference to be significant, but the comprehension angle is to me (particularly from the maintenance, investigation, and learn-from-other-people's-code perspectives).

I am not used to til (and I think I am on the slower side regarding picking things up), but I think it may also at least partly be a case of:

And this is... weird. Nowhere else is there a PEG that captures and advances differently -- captures are defined by how they advance the input.

ianthehenry · 2024-12-05T04:10:01Z

So I've been thinking about this and while I like til, I don't think the complexity is worth it; I spent a long time thinking about look and how really it should also, logically, capture a different amount than it advances but that's not currently possible to express with an offset as you now need to return "capture start" and "capture end" and that's even more complexity that's not clearly worth it.

So I have what I think is a better idea: a (til sep patt) form that behaves the way that (sub (til sep) patt) would in this feature. It loses symmetry with (to) and (thru), which is slightly sad, but it preserves the property that captures are always "motions" and I think that conceptual simplicity is more important. The only thing that's slightly more annoying is that something like '(til "=") becomes (til "=" '(to -1)) and traverses the substring twice, but that could be mitigated with an optimized helper for (to -1), which seems like a useful thing anyway to pair with split or sub. But I think that in all actual cases where I've wanted til, I've wanted to pair it with sub, so this form is both simpler to implement and more convenient to use in practice.

(til sep subpattern) is a specialized (sub) that behaves like (sub (to sep) subpattern), but advances over the input like (thru sep).

sogaiu · 2024-12-05T12:27:34Z

@ianthehenry To aid in trying to digest the new version, please mention if this is different from what you mentioned in this discussion.

In general, I find the current proposed changes (especially to the C code) to be far less worrying so 👍 on that front.

If this proceeds further, I wonder if we could consider alternate names. I think to and thru are a bit close to til and they are now somewhat more different from the latest til idea.

For newcomers (and folks who need to "rediscover" later like yours truly), may be something a bit more "distant" could work a bit better on the front of being less confusing. Perhaps skip could also be considered (sorry it's one letter longer)? May be others might have some other name ideas too...

For others who might not have checked, there are tests in 9529062 which I'll reproduce below (though pardon my translation -- it likely will help my comprehension and possibly make the combined content more accessible to a wider audience):

# basic matching
(peg/match ~(til "d" "abc")
           "abcdef")
# =>
@[]

# second pattern can't see past the first occurrence of first pattern
(peg/match ~(til "d"
                 (sequence "abc" -1))
           "abcdef")
# =>
@[]

# fails if first pattern fails
(peg/match ~(til "x" "abc")
           "abcdef")
# =>
nil

# fails if second pattern fails
(peg/match ~(til "abc" "x")
           "abcdef")
# =>
nil

# discards captures from initial pattern
(peg/match ~(til (capture "d")
                 (capture "abc"))
           "abcdef")
# =>
@["abc"]

# positions inside second match are still relative to the entire input
(peg/match ~(sequence "one\ntw"
                      (til 0
                           (sequence (position) (line) (column))))
           "one\ntwo\nthree\n")
# =>
@[6 2 3]

# advances to the end of the first pattern's first occurrence
(peg/match ~(sequence (til "d" "ab")
                      "e")
           "abcdef")
# =>
@[]

pyrmont · 2024-12-05T13:14:54Z

@sogaiu wrote:

If this proceeds further, I wonder if we could consider alternate names. I think to and thru are a bit close to til and they are now somewhat more different from the latest til idea.

For newcomers (and folks who need to "rediscover" later like yours truly), may be something a bit more "distant" could work a bit better on the front of being less confusing. Perhaps skip could also be considered (sorry it's one letter longer)? May be others might have some other name ideas too...

I see the utility in having a way to write terse PEGs for this situation but I’m a bit wary of the name, too. To me to and til function as synonyms in the sense used here and it’s not obvious to me that one would go up to a pattern and stop and one would go up to a pattern and move over it.

For alternative names, I also like skip or alternatively pass or over.

sogaiu · 2024-12-05T14:08:36Z

pass seems good to me too.

Would you mind expanding a bit on over? What comes to mind immediately is something along the lines of "above" or "covering", but may be I'm failing to fish out an appropriate meaning (^^;

ianthehenry · 2024-12-05T18:08:27Z

Oh lol I honestly forgot that I had implemented this before. Yeah, it’s the same proposal that I made a year ago. Apparently I already figured out to do it the “simple” way back then and had to independently rediscover this API after trying something more complicated. Baby brain is something else.

I’m not married to the name; I agree to and til are synonyms in English… I don’t really think that would cause confusion in practice (since to know either of them exist, you’re looking at the docs that explain their difference) but I am open to something more explicit.

I think skip isn’t intuitive to me (I think I’d expect behavior like drop); pass seems better. I’m kinda used to that meaning “do nothing” but in context that won’t be confusing. over is also pretty good. I would also throw in sub-til; it’s not quite sub-to but hints at the meaning. It’s longer but still much shorter than not using it. Maybe until also separates it farther from to to not create a mental association…

pyrmont · 2024-12-06T01:40:05Z

@sogaiu wrote:

Would you mind expanding a bit on over? What comes to mind immediately is something along the lines of "above" or "covering", but may be I'm failing to fish out an appropriate meaning (^^;

I was thinking 'over' in the sense of 'step over' or 'pass over'. That is to say, to move past but to ignore.

@ianthehenry wrote:

I don’t really think that would cause confusion in practice (since to know either of them exist, you’re looking at the docs that explain their difference) but I am open to something more explicit.

I don't mean to belabour the point but just to explain things more clearly than I did in the original message: my perspective is that function names have two purposes in programs. The first is the identification (or referent) purpose (i.e. you're telling the machine, 'do the instruction referred to by the identifier X'). The second purpose is mnemonic (i.e. you're reminding the consumer, 'this does X'). It's for the latter reason that I don't like til (or variations). You're correct that to use these things (at first) you're going to need to look them up and so the name is somewhat arbitrary. But I'd be worried that something like til is ill-fitting for the mnemonic purpose and makes the combinator more cumbersome.

I think skip isn’t intuitive to me (I think I’d expect behavior like drop); pass seems better. I’m kinda used to that meaning “do nothing” but in context that won’t be confusing. over is also pretty good. I would also throw in sub-til; it’s not quite sub-to but hints at the meaning. It’s longer but still much shorter than not using it.

At the risk of excessive bikeshedding, if you want something shorter, there's also by (as in 'pass by'). Here's the simple examples using these alternatives:

# current
(peg/match ~(til "d" "abc") "abcdef") # => @[]

# alternatives (in alphabetical order)
(peg/match ~(by "d" "abc") "abcdef") # => @[]
(peg/match ~(over "d" "abc") "abcdef") # => @[]
(peg/match ~(pass "d" "abc") "abcdef") # => @[]
(peg/match ~(skip "d" "abc") "abcdef") # => @[]
(peg/match ~(sub-til "d" "abc") "abcdef") # => @[]
(peg/match ~(until "d" "abc") "abcdef") # => @[]

pyrmont · 2024-12-06T01:59:07Z

Oh, and I should have added I lean towards pass.

sogaiu · 2024-12-06T03:13:17Z

@pyrmont Thanks for that explicit listing using the different names. I share the same concern regarding the "mnemonic" angle.

@ianthehenry Thanks for the clarification regarding the version of til from the earlier gh discussion.

I can see how skip might not be intuitive.

ATM, I find pass to be the best balance between length and low confusion.

ianthehenry · 2024-12-06T03:59:05Z

Here’s a different tack: the operation, really, is split-one. I’ve been thinking of it as a variation of to, but really it’s easier to describe in terms of split. It’s like split, but it doesn’t keep going after it finds the first separator. (split x y) ≈≈ (some (split-one x y)). Not exactly given end-of-file handling but it’s decent intuition.

split-one or split-once are kind of mouthfuls though. OCaml calls this lsplit which is concise but cryptic (there is a corresponding rsplit that finds the last occurrence of the separator).

So I would still prefer a shorter name and want to consider sep. It has a nice lexicographic symmetry with sub. It has the problem that “split” and “sep” are kind of synonyms, but I think split is a common enough name for that operation that we won’t mix them up. It has a nice mnemonic quality: it separates the string into “before” and “after” parts.

sogaiu · 2024-12-06T06:31:09Z

Cast in the light of viewing things as a kind of "limited" split, I started to wonder about having an optional argument for split...

string/split's last argument (optional) is limit. Would a similar thing work for the split for pegs?

If so, perhaps there could be a short-hand name like til / pass / sep (or other name choice) that is split used with an appropriate value for the optional limit argument "underneath".

Kind of like how (opt patt) / (? patt) is (between 0 1 patt).

ianthehenry · 2024-12-08T05:10:54Z

That's a really interesting idea. For complete consistency with string/split, a limit would mean:

(peg/match ~(* (split :s+ ':w+) '(to -1) 1) "a b c d e")
# => ["a b c d e"] (nothing gets split)

(peg/match ~(* (split :s+ ':w+) '(to -1) 2) "a b c d e")
# => ["a" "b c d e"] (first occurrence gets split; first separator is nowhere to be seen)

But that seems unintuitive to me. I think the behavior of string/split's limit is... I think slightly unintuitive too, but it can be explained and understood in a simple way that doesn't apply to the PEG equivalent. I would propose breaking with the behavior of string/split and instead:

(peg/match ~(* (split :s+ ':w+) '(to -1) 1) "a b c d e")
# => ["a" "b c d e"] (1 means match one separator)

(peg/match ~(* (split :s+ ':w+) '(to -1) 2) "a b c d e")
# => ["a" "b" "c d e"] (2 means match 2 separators)

That is the behavior I'd expect. The inconsistency gives me a little pause but I think it makes sense in this case.

I have trouble imagining a case where this generality is useful, though -- I have never actually wanted a limited split beyond this "parse to the next delimiter" case. I think I would actually invert it, and say sep is the simple, primitive operation, and split is a special-case of sep. You can exactly implement split out of sep -- even though you cannot implement sep out of the split we have today. And since we already have limited repetition, you can implement a "limited split" with (4 (sep "," ':w+), without any ambiguity about how that limit behaves. And you can go further, using between or lenprefix, and that gives you more flexibility than an optional argument to split would.

pyrmont · 2024-12-08T12:14:43Z

@ianthehenry Are the parens misplaced in your most recent examples? You talk about an optional integer argument to the split combinator but the placement of your parens, makes the integers an argument to * (i.e. sequence).

sogaiu · 2024-12-08T13:03:22Z

@ianthehenry

since we already have limited repetition, you can implement a "limited split" with (4 (sep "," ':w+), without any ambiguity about how that limit behaves.

Ah, that sounds very nice! I haven't tested it out, but it seems plausible.

And you can go further, using between or lenprefix, and that gives you more flexibility than an optional argument to split would.

Also sounds good 👍

Regarding the name...

To repeat what was stated earlier:

it separates the string into “before” and “after” parts.

I like how the name itself (sep) kind of illustrates what it does -- i.e. it produces sep from a full word (either separate or separator) by sort of being the result of its own action. Perhaps that kind of story can help with memory.

Although as was mentioned earlier, split and sep might be considered kind of synonymous, with the story above, plus that the string sep doesn't appear to be used widely as an identifier indicative of an action, may be this works pretty well as a name that suffers less from confusion potential than the other alternatives.

As a historical note, I'm leaving a link to this PR which also mentions a sep. IIUC, that PR was very close in time to this discussion mentioned above.

github-advanced-security bot found potential problems Dec 4, 2024

View reviewed changes

src/core/peg.c Fixed Show fixed Hide fixed

src/core/peg.c Fixed Show fixed Hide fixed

ianthehenry force-pushed the til-peg-special branch from dfc3366 to 64b1d91 Compare December 4, 2024 15:29

ianthehenry marked this pull request as draft December 5, 2024 04:10

add (til) PEG special

9529062

(til sep subpattern) is a specialized (sub) that behaves like (sub (to sep) subpattern), but advances over the input like (thru sep).

ianthehenry force-pushed the til-peg-special branch from 64b1d91 to 9529062 Compare December 5, 2024 05:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add (til) PEG special #1528

add (til) PEG special #1528

ianthehenry commented Dec 4, 2024

pepe commented Dec 4, 2024

sogaiu commented Dec 4, 2024 •

edited

Loading

ianthehenry commented Dec 4, 2024

CFiggers commented Dec 4, 2024

ianthehenry commented Dec 4, 2024

sogaiu commented Dec 5, 2024

ianthehenry commented Dec 5, 2024 •

edited

Loading

sogaiu commented Dec 5, 2024 •

edited

Loading

pyrmont commented Dec 5, 2024

sogaiu commented Dec 5, 2024

ianthehenry commented Dec 5, 2024 •

edited

Loading

pyrmont commented Dec 6, 2024

pyrmont commented Dec 6, 2024

sogaiu commented Dec 6, 2024

ianthehenry commented Dec 6, 2024

sogaiu commented Dec 6, 2024 •

edited

Loading

ianthehenry commented Dec 8, 2024

pyrmont commented Dec 8, 2024

sogaiu commented Dec 8, 2024 •

edited

Loading

add (til) PEG special #1528

Are you sure you want to change the base?

add (til) PEG special #1528

Conversation

ianthehenry commented Dec 4, 2024

pepe commented Dec 4, 2024

sogaiu commented Dec 4, 2024 • edited Loading

ianthehenry commented Dec 4, 2024

CFiggers commented Dec 4, 2024

ianthehenry commented Dec 4, 2024

sogaiu commented Dec 5, 2024

ianthehenry commented Dec 5, 2024 • edited Loading

sogaiu commented Dec 5, 2024 • edited Loading

pyrmont commented Dec 5, 2024

sogaiu commented Dec 5, 2024

ianthehenry commented Dec 5, 2024 • edited Loading

pyrmont commented Dec 6, 2024

pyrmont commented Dec 6, 2024

sogaiu commented Dec 6, 2024

ianthehenry commented Dec 6, 2024

sogaiu commented Dec 6, 2024 • edited Loading

ianthehenry commented Dec 8, 2024

pyrmont commented Dec 8, 2024

sogaiu commented Dec 8, 2024 • edited Loading

sogaiu commented Dec 4, 2024 •

edited

Loading

ianthehenry commented Dec 5, 2024 •

edited

Loading

sogaiu commented Dec 5, 2024 •

edited

Loading

ianthehenry commented Dec 5, 2024 •

edited

Loading

sogaiu commented Dec 6, 2024 •

edited

Loading

sogaiu commented Dec 8, 2024 •

edited

Loading