Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add (til) PEG special #1528

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ianthehenry
Copy link
Contributor

This is... maybe not a real pull request; this is an idea with an implementation and some tests. But it's an idea that adds a bit of complexity to the PEG engine and is maybe not worth it.

I often want to write a PEG somewhere in between to and thru, usually while parsing something like key=value: capture everything up to =, skip over the =, and then match everything after. This is a little clumsy right now:

(* '(to "=") "=" '(to -1))

Because you have to repeat the separator (and even though this doesn't actually matter in any case where I've done this, it's a little sad that you evaluate the separator PEG one more time than is necessary).

So (til) is a special that makes this easy to write. It captures like (to), but advances like (thru).

And this is... weird. Nowhere else is there a PEG that captures and advances differently -- captures are defined by how they advance the input. Until now. So this PR adds complexity to PEG rules, as each PEG rule can basically return two things now: how far to advance, and how far to capture (if currently capturing).

I'm mostly hesitant about this because it's not 100% obvious how each other PEG special should interact with (til). I made judgment calls that I think are reasonable but I could see a case for e.g. either of these behaviors:

(check-deep ~''(til "=") "key=value" @["key" "key"])
(check-deep ~''(til "=") "key=value" @["key" "key="])

(I chose the latter because it's slightly simpler to implement and this will never come up in practice, but I feel slightly weird about it.)

I think this test summarizes (til) best:

(check-deep ~(* '(to "=") '(to -1)) "key=value" @["key" "=value"])
(check-deep ~(* '(til "=") '(til -1)) "key=value" @["key" "value"])
(check-deep ~(* '(thru "=") '(thru -1)) "key=value" @["key=" "value"])

src/core/peg.c Fixed Show fixed Hide fixed
src/core/peg.c Fixed Show fixed Hide fixed
@pepe
Copy link
Member

pepe commented Dec 4, 2024

I have often encountered this pattern, so I can feel the pain here. Yet, I do not see it as a big problem. I do not understand all the changes in the PR correctly, but it feels like a lot of code to me :-D.

@sogaiu
Copy link
Contributor

sogaiu commented Dec 4, 2024

For the key=value example, it occurred to me do to this:

(peg/match ~(split "=" (capture (to -1))) 
           "key=value")
# =>
@["key" "value"]

This seems to avoid repeating "=" and is may be less sad?

May be there are other examples for which this type of approach is not so good perhaps?


Update: Another idea:

(peg/match ~(sequence (capture (to "="))
                      1
                      (capture (to -1))) 
           "key=value")
# =>
@["key" "value"]

@ianthehenry
Copy link
Contributor Author

I pushed a new version with a slightly cleaner implementation (replace NULL with &ignore_capture_to_out and remove the NULL checks).

@sogaiu my example was pretty simple where the alternatives you propose would work, but imagine something more complicated, e.g.:

(* '(to :s+) :s+ (number :d+))
"foo 123"
"foo   123"

I don't know of a way to do this in general (with an arbitrary PEG as the separator) without this rule repetition.

This is also useful with sub, e.g.:

(* (sub (til "; ") (split ", " ':w+)) (number :d+))
"foo, bar, baz; 123"

Which is where I mostly wanted it in advent of code problems last year.

@CFiggers
Copy link

CFiggers commented Dec 4, 2024

Zooming out on this one step further: wouldn't it be nice if the PEG module were extensible from within the language, so we could define new "combinators" like this one on a per-project basis without needing to modify the language as a whole.

That would allow for itches like this one to be solved on a per-project basis without needing to add baggage to every deployment of PEG across the entire Janet universe, and without requiring anyone to plumb around in the C to get a bespoke capture like this one working.

@ianthehenry
Copy link
Contributor Author

Welllll you can already write helper combinators that use existing PEG machinery; this is (similarly to sub) an extension of what it means to be a PEG that is not expressible with the current machinery. I guess although it would be nice to be able to extend PEG engines dynamically, I can’t picture what that would look like in practice for a change like this one

@sogaiu
Copy link
Contributor

sogaiu commented Dec 5, 2024

I don't find the short forms readily comprehensible so apologies for "translating" below (I hope I didn't mess any up!).

For:

(* '(to :s+) :s+ (number :d+))
"foo 123"
"foo   123"

What came to mind for a peg to handle the two examples didn't involve an exact repetition of :s+:

(peg/match ~(sequence (capture (to :s)) 
                      :s+ 
                      (number :d+))
           "foo   123")
# =>
"foo   123"

Perhaps this particular case is a matter of how our perceptions happened to view things. (Just to be clear, I'm not trying to claim the peg I wrote above is better or anything.)

Regarding:

(* (sub (til "; ") (split ", " ':w+)) (number :d+))
"foo, bar, baz; 123"

Faced with this example, what surfaced here was:

(peg/match ~(sequence (sub (to ";")
                           (split ", " (capture :w+)))
                      1
                      :s+
                      (number :d+))
           "foo, bar, baz; 123")
# =>
@["foo" "bar" "baz" 123]

This is longer though more straight-forward to me (though I wrote it so not sure how much the latter point is worth).

I don't find this much of a length difference to be significant, but the comprehension angle is to me (particularly from the maintenance, investigation, and learn-from-other-people's-code perspectives).

I am not used to til (and I think I am on the slower side regarding picking things up), but I think it may also at least partly be a case of:

And this is... weird. Nowhere else is there a PEG that captures and advances differently -- captures are defined by how they advance the input.

@ianthehenry
Copy link
Contributor Author

ianthehenry commented Dec 5, 2024

So I've been thinking about this and while I like til, I don't think the complexity is worth it; I spent a long time thinking about look and how really it should also, logically, capture a different amount than it advances but that's not currently possible to express with an offset as you now need to return "capture start" and "capture end" and that's even more complexity that's not clearly worth it.

So I have what I think is a better idea: a (til sep patt) form that behaves the way that (sub (til sep) patt) would in this feature. It loses symmetry with (to) and (thru), which is slightly sad, but it preserves the property that captures are always "motions" and I think that conceptual simplicity is more important. The only thing that's slightly more annoying is that something like '(til "=") becomes (til "=" '(to -1)) and traverses the substring twice, but that could be mitigated with an optimized helper for (to -1), which seems like a useful thing anyway to pair with split or sub. But I think that in all actual cases where I've wanted til, I've wanted to pair it with sub, so this form is both simpler to implement and more convenient to use in practice.

@ianthehenry ianthehenry marked this pull request as draft December 5, 2024 04:10
(til sep subpattern) is a specialized (sub) that behaves like
(sub (to sep) subpattern), but advances over the input like (thru sep).
@sogaiu
Copy link
Contributor

sogaiu commented Dec 5, 2024

@ianthehenry To aid in trying to digest the new version, please mention if this is different from what you mentioned in this discussion.


In general, I find the current proposed changes (especially to the C code) to be far less worrying so 👍 on that front.


If this proceeds further, I wonder if we could consider alternate names. I think to and thru are a bit close to til and they are now somewhat more different from the latest til idea.

For newcomers (and folks who need to "rediscover" later like yours truly), may be something a bit more "distant" could work a bit better on the front of being less confusing. Perhaps skip could also be considered (sorry it's one letter longer)? May be others might have some other name ideas too...


For others who might not have checked, there are tests in 9529062 which I'll reproduce below (though pardon my translation -- it likely will help my comprehension and possibly make the combined content more accessible to a wider audience):

# basic matching
(peg/match ~(til "d" "abc")
           "abcdef")
# =>
@[]

# second pattern can't see past the first occurrence of first pattern
(peg/match ~(til "d"
                 (sequence "abc" -1))
           "abcdef")
# =>
@[]

# fails if first pattern fails
(peg/match ~(til "x" "abc")
           "abcdef")
# =>
nil

# fails if second pattern fails
(peg/match ~(til "abc" "x")
           "abcdef")
# =>
nil

# discards captures from initial pattern
(peg/match ~(til (capture "d")
                 (capture "abc"))
           "abcdef")
# =>
@["abc"]

# positions inside second match are still relative to the entire input
(peg/match ~(sequence "one\ntw"
                      (til 0
                           (sequence (position) (line) (column))))
           "one\ntwo\nthree\n")
# =>
@[6 2 3]

# advances to the end of the first pattern's first occurrence
(peg/match ~(sequence (til "d" "ab")
                      "e")
           "abcdef")
# =>
@[]

@pyrmont
Copy link
Contributor

pyrmont commented Dec 5, 2024

@sogaiu wrote:

If this proceeds further, I wonder if we could consider alternate names. I think to and thru are a bit close to til and they are now somewhat more different from the latest til idea.

For newcomers (and folks who need to "rediscover" later like yours truly), may be something a bit more "distant" could work a bit better on the front of being less confusing. Perhaps skip could also be considered (sorry it's one letter longer)? May be others might have some other name ideas too...

I see the utility in having a way to write terse PEGs for this situation but I’m a bit wary of the name, too. To me to and til function as synonyms in the sense used here and it’s not obvious to me that one would go up to a pattern and stop and one would go up to a pattern and move over it.

For alternative names, I also like skip or alternatively pass or over.

@sogaiu
Copy link
Contributor

sogaiu commented Dec 5, 2024

pass seems good to me too.

Would you mind expanding a bit on over? What comes to mind immediately is something along the lines of "above" or "covering", but may be I'm failing to fish out an appropriate meaning (^^;

@ianthehenry
Copy link
Contributor Author

ianthehenry commented Dec 5, 2024

Oh lol I honestly forgot that I had implemented this before. Yeah, it’s the same proposal that I made a year ago. Apparently I already figured out to do it the “simple” way back then and had to independently rediscover this API after trying something more complicated. Baby brain is something else.

I’m not married to the name; I agree to and til are synonyms in English… I don’t really think that would cause confusion in practice (since to know either of them exist, you’re looking at the docs that explain their difference) but I am open to something more explicit.

I think skip isn’t intuitive to me (I think I’d expect behavior like drop); pass seems better. I’m kinda used to that meaning “do nothing” but in context that won’t be confusing. over is also pretty good. I would also throw in sub-til; it’s not quite sub-to but hints at the meaning. It’s longer but still much shorter than not using it. Maybe until also separates it farther from to to not create a mental association…

@pyrmont
Copy link
Contributor

pyrmont commented Dec 6, 2024

@sogaiu wrote:

Would you mind expanding a bit on over? What comes to mind immediately is something along the lines of "above" or "covering", but may be I'm failing to fish out an appropriate meaning (^^;

I was thinking 'over' in the sense of 'step over' or 'pass over'. That is to say, to move past but to ignore.

@ianthehenry wrote:

I don’t really think that would cause confusion in practice (since to know either of them exist, you’re looking at the docs that explain their difference) but I am open to something more explicit.

I don't mean to belabour the point but just to explain things more clearly than I did in the original message: my perspective is that function names have two purposes in programs. The first is the identification (or referent) purpose (i.e. you're telling the machine, 'do the instruction referred to by the identifier X'). The second purpose is mnemonic (i.e. you're reminding the consumer, 'this does X'). It's for the latter reason that I don't like til (or variations). You're correct that to use these things (at first) you're going to need to look them up and so the name is somewhat arbitrary. But I'd be worried that something like til is ill-fitting for the mnemonic purpose and makes the combinator more cumbersome.

I think skip isn’t intuitive to me (I think I’d expect behavior like drop); pass seems better. I’m kinda used to that meaning “do nothing” but in context that won’t be confusing. over is also pretty good. I would also throw in sub-til; it’s not quite sub-to but hints at the meaning. It’s longer but still much shorter than not using it.

At the risk of excessive bikeshedding, if you want something shorter, there's also by (as in 'pass by'). Here's the simple examples using these alternatives:

# current
(peg/match ~(til "d" "abc") "abcdef") # => @[]

# alternatives (in alphabetical order)
(peg/match ~(by "d" "abc") "abcdef") # => @[]
(peg/match ~(over "d" "abc") "abcdef") # => @[]
(peg/match ~(pass "d" "abc") "abcdef") # => @[]
(peg/match ~(skip "d" "abc") "abcdef") # => @[]
(peg/match ~(sub-til "d" "abc") "abcdef") # => @[]
(peg/match ~(until "d" "abc") "abcdef") # => @[]

@pyrmont
Copy link
Contributor

pyrmont commented Dec 6, 2024

Oh, and I should have added I lean towards pass.

@sogaiu
Copy link
Contributor

sogaiu commented Dec 6, 2024

@pyrmont Thanks for that explicit listing using the different names. I share the same concern regarding the "mnemonic" angle.

@ianthehenry Thanks for the clarification regarding the version of til from the earlier gh discussion.

I can see how skip might not be intuitive.

ATM, I find pass to be the best balance between length and low confusion.

@ianthehenry
Copy link
Contributor Author

Here’s a different tack: the operation, really, is split-one. I’ve been thinking of it as a variation of to, but really it’s easier to describe in terms of split. It’s like split, but it doesn’t keep going after it finds the first separator. (split x y) ≈≈ (some (split-one x y)). Not exactly given end-of-file handling but it’s decent intuition.

split-one or split-once are kind of mouthfuls though. OCaml calls this lsplit which is concise but cryptic (there is a corresponding rsplit that finds the last occurrence of the separator).

So I would still prefer a shorter name and want to consider sep. It has a nice lexicographic symmetry with sub. It has the problem that “split” and “sep” are kind of synonyms, but I think split is a common enough name for that operation that we won’t mix them up. It has a nice mnemonic quality: it separates the string into “before” and “after” parts.

@sogaiu
Copy link
Contributor

sogaiu commented Dec 6, 2024

Cast in the light of viewing things as a kind of "limited" split, I started to wonder about having an optional argument for split...

string/split's last argument (optional) is limit. Would a similar thing work for the split for pegs?

If so, perhaps there could be a short-hand name like til / pass / sep (or other name choice) that is split used with an appropriate value for the optional limit argument "underneath".

Kind of like how (opt patt) / (? patt) is (between 0 1 patt).

@ianthehenry
Copy link
Contributor Author

That's a really interesting idea. For complete consistency with string/split, a limit would mean:

(peg/match ~(* (split :s+ ':w+) '(to -1) 1) "a b c d e")
# => ["a b c d e"] (nothing gets split)

(peg/match ~(* (split :s+ ':w+) '(to -1) 2) "a b c d e")
# => ["a" "b c d e"] (first occurrence gets split; first separator is nowhere to be seen)

But that seems unintuitive to me. I think the behavior of string/split's limit is... I think slightly unintuitive too, but it can be explained and understood in a simple way that doesn't apply to the PEG equivalent. I would propose breaking with the behavior of string/split and instead:

(peg/match ~(* (split :s+ ':w+) '(to -1) 1) "a b c d e")
# => ["a" "b c d e"] (1 means match one separator)

(peg/match ~(* (split :s+ ':w+) '(to -1) 2) "a b c d e")
# => ["a" "b" "c d e"] (2 means match 2 separators)

That is the behavior I'd expect. The inconsistency gives me a little pause but I think it makes sense in this case.

I have trouble imagining a case where this generality is useful, though -- I have never actually wanted a limited split beyond this "parse to the next delimiter" case. I think I would actually invert it, and say sep is the simple, primitive operation, and split is a special-case of sep. You can exactly implement split out of sep -- even though you cannot implement sep out of the split we have today. And since we already have limited repetition, you can implement a "limited split" with (4 (sep "," ':w+), without any ambiguity about how that limit behaves. And you can go further, using between or lenprefix, and that gives you more flexibility than an optional argument to split would.

@pyrmont
Copy link
Contributor

pyrmont commented Dec 8, 2024

@ianthehenry Are the parens misplaced in your most recent examples? You talk about an optional integer argument to the split combinator but the placement of your parens, makes the integers an argument to * (i.e. sequence).

@sogaiu
Copy link
Contributor

sogaiu commented Dec 8, 2024

@ianthehenry

since we already have limited repetition, you can implement a "limited split" with (4 (sep "," ':w+), without any ambiguity about how that limit behaves.

Ah, that sounds very nice! I haven't tested it out, but it seems plausible.

And you can go further, using between or lenprefix, and that gives you more flexibility than an optional argument to split would.

Also sounds good 👍


Regarding the name...

To repeat what was stated earlier:

it separates the string into “before” and “after” parts.

I like how the name itself (sep) kind of illustrates what it does -- i.e. it produces sep from a full word (either separate or separator) by sort of being the result of its own action. Perhaps that kind of story can help with memory.

Although as was mentioned earlier, split and sep might be considered kind of synonymous, with the story above, plus that the string sep doesn't appear to be used widely as an identifier indicative of an action, may be this works pretty well as a name that suffers less from confusion potential than the other alternatives.


As a historical note, I'm leaving a link to this PR which also mentions a sep. IIUC, that PR was very close in time to this discussion mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants