Closed
Description
Implementing a streaming API on top of a RegExp-based tokenizer is fraught with difficulty, as discussed in #36. It's not clear what to do when a token overlaps with a chunk boundary; should we:
- Require users to feed chunks which are already split on token boundaries
- Buffer input until we get a regex match that doesn't extend to the end of the buffer —which introduces unfortunate restrictions on the lexer definition.
- Re-lex the entire input when we receive new data, and somehow notify the consumer to discard some (or all!) of the old tokens (ouch)
- Throw everything away and write our own RegExp implementation from scratch; then we can directly query the FSM to see if the node we're on can transition past the next character!
Only (1) seems like a workable solution. Excepting (4) (!), the rest you can re-implement yourself on top of reset()
. So I propose removing feed()
and re-introducing remaining()
[which would return buffer.slice(index)
]. Thus Moo is a fast lexer core; people can extend it with hacky strategies as necessary. :-)