String splitting algorithms could use optional "nesting characters" #107

tabatkins · 2017-03-27T15:49:41Z

When splitting strings, it's reasonably common to only want to split on "top-level" instances of the split chars, and have "nesting" characters, like parens, within which you don't look for the splitting characters. For example, splitting a string on commas, but the string can contain functions with comma-separated arguments.

Most of the strings I work with get parsed by CSS, which has a "split by top-level comma" algo already, so I don't have a concrete use for this in Infra just yet, but I use that algorithm commonly enough that I'd bet other people would benefit from having something like it available, at least for parens.

You'd need a list of start/end string pairs, and keep a stack of start strings seen that gets popped when the topmost end string is seen, and only trigger splitting when the stack is empty.

annevk · 2017-03-27T15:54:12Z

It seems your comment got cut off. Is this used outside CSS parsers? Because inside CSS, you need to handle all kinds of other CSS rarities too such as escapes and then you might as well invoke the CSS parser to be sure.

tabatkins · 2017-03-28T00:08:55Z

Comment was cut off and I finished editting immediately, but apparently not before you checked the thread. Check again. ^_^

zcorpan · 2017-03-28T16:39:50Z

srcset parser has something like this, but without a stack (it uses a dumb state machine).

tabatkins · 2017-03-28T21:38:14Z

Looks like srcset only looks for parens to account for possible future CSS functions? Right now the only valid descriptors are 1w, 1x, and 1h.

If that's the case, then the algo is broken - it'll misparse at times once that starts being allowed. It needs to track nesting, and you're making my point for me. ^_^

annevk · 2017-03-29T06:02:03Z

If srcset needs to be compatible with CSS it also needs to handle escapes and should just be defined by the CSS parser, I think.

zcorpan · 2017-03-29T17:00:21Z

It's not for CSS but for future descriptors like integrity(). The algorithm is intentionally "simple" and CSS compat is not a goal.

annevk · 2018-04-19T09:45:35Z

I think I'll need this at least for Content-Type and possibly other HTTP headers, but I'm not sure yet whether I want those to operate on strings or bytes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String splitting algorithms could use optional "nesting characters" #107

String splitting algorithms could use optional "nesting characters" #107

tabatkins commented Mar 27, 2017 •

edited

Loading

annevk commented Mar 27, 2017

tabatkins commented Mar 28, 2017

zcorpan commented Mar 28, 2017

tabatkins commented Mar 28, 2017

annevk commented Mar 29, 2017

zcorpan commented Mar 29, 2017

annevk commented Apr 19, 2018

String splitting algorithms could use optional "nesting characters" #107

String splitting algorithms could use optional "nesting characters" #107

Comments

tabatkins commented Mar 27, 2017 • edited Loading

annevk commented Mar 27, 2017

tabatkins commented Mar 28, 2017

zcorpan commented Mar 28, 2017

tabatkins commented Mar 28, 2017

annevk commented Mar 29, 2017

zcorpan commented Mar 29, 2017

annevk commented Apr 19, 2018

tabatkins commented Mar 27, 2017 •

edited

Loading