Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String splitting algorithms could use optional "nesting characters" #107

Open
tabatkins opened this issue Mar 27, 2017 · 7 comments
Open

Comments

@tabatkins
Copy link
Contributor

tabatkins commented Mar 27, 2017

When splitting strings, it's reasonably common to only want to split on "top-level" instances of the split chars, and have "nesting" characters, like parens, within which you don't look for the splitting characters. For example, splitting a string on commas, but the string can contain functions with comma-separated arguments.

Most of the strings I work with get parsed by CSS, which has a "split by top-level comma" algo already, so I don't have a concrete use for this in Infra just yet, but I use that algorithm commonly enough that I'd bet other people would benefit from having something like it available, at least for parens.

You'd need a list of start/end string pairs, and keep a stack of start strings seen that gets popped when the topmost end string is seen, and only trigger splitting when the stack is empty.

@annevk
Copy link
Member

annevk commented Mar 27, 2017

It seems your comment got cut off. Is this used outside CSS parsers? Because inside CSS, you need to handle all kinds of other CSS rarities too such as escapes and then you might as well invoke the CSS parser to be sure.

@tabatkins
Copy link
Contributor Author

Comment was cut off and I finished editting immediately, but apparently not before you checked the thread. Check again. ^_^

@zcorpan
Copy link
Member

zcorpan commented Mar 28, 2017

srcset parser has something like this, but without a stack (it uses a dumb state machine).

@tabatkins
Copy link
Contributor Author

Looks like srcset only looks for parens to account for possible future CSS functions? Right now the only valid descriptors are 1w, 1x, and 1h.

If that's the case, then the algo is broken - it'll misparse at times once that starts being allowed. It needs to track nesting, and you're making my point for me. ^_^

@annevk
Copy link
Member

annevk commented Mar 29, 2017

If srcset needs to be compatible with CSS it also needs to handle escapes and should just be defined by the CSS parser, I think.

@zcorpan
Copy link
Member

zcorpan commented Mar 29, 2017

It's not for CSS but for future descriptors like integrity(). The algorithm is intentionally "simple" and CSS compat is not a goal.

@annevk
Copy link
Member

annevk commented Apr 19, 2018

I think I'll need this at least for Content-Type and possibly other HTTP headers, but I'm not sure yet whether I want those to operate on strings or bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants