-
Notifications
You must be signed in to change notification settings - Fork 533
Parse grammar without regexes #1827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
9525e43
to
2786fb4
Compare
} | ||
|
||
fn parse_charset(&mut self) -> Result<ExpressionKind> { | ||
self.expect("[", "expected opening [")?; | ||
let mut characters = Vec::new(); | ||
loop { | ||
self.space0(); | ||
let Some(ch) = self.parse_characters() else { | ||
let Some(ch) = self.parse_characters()? else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now notice and report errors when parsing the elements of a character class.
self.index = recov + 1; | ||
bail!(self, "invalid start terminal in range"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recover the parser index before reporting errors so as to put the arrow in the correct place.
static UNICODE_RE: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"^[A-Z0-9]{4}").unwrap()); | ||
|
||
match self.take_re(&UNICODE_RE) { | ||
Some(s) => Ok(ExpressionKind::Unicode(s[0].to_string())), | ||
None => bail!(self, "expected 4 hexadecimal uppercase digits after U+"), | ||
let mut xs = Vec::with_capacity(4); | ||
for _ in 0..4 { | ||
match self.peek() { | ||
Some(x @ (b'0'..=b'9' | b'A'..=b'F')) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code was accepting A-Z
in a hex digit. We narrow this down to A-F
.
match self.take_re(&FOOTNOTE_RE) { | ||
Some(cap) => Ok(Some(cap[1].to_string())), | ||
None => bail!(self, "unterminated footnote, expected closing `]`"), | ||
let id = self.take_while(&|x| !['\n', ']'].contains(&x)).to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really I'd prefer this to_string
call to happen at the bottom, but alas, this is one of those borrow checker limitations.
static PROSE_RE: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"^<([^>\n]+)>").unwrap()); | ||
match self.take_re(&PROSE_RE) { | ||
Some(cap) => Ok(ExpressionKind::Prose(cap[1].to_string())), | ||
None => bail!(self, "unterminated prose, expected closing `>`"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere, we were incorrectly reporting things being unterminated when they were actually empty. This is now fixed.
We'd been parsing the grammar with a combination of recursive descent and regular expression matchers. This combination has its merits, and it's done tastefully here, but it seems maybe more straightforward to do the parsing entirely with recursive descent. Among other things, doing it this way allows us to provide more precise error reporting on malformed inputs. The cost, in terms of lines of code, of doing this is rather modest, and the result seems at least as clear -- there's some mental cost to code switching between the two worlds. So let's make the switch and parse the grammar without regular expressions. We verified that the rendered output of the Reference is byte identical before and after this change.
2786fb4
to
1a7304b
Compare
We'd been parsing the grammar with a combination of recursive descent and regular expression matchers. This combination has its merits, and it's done tastefully here, but it seems maybe more straightforward to do the parsing entirely with recursive descent. Among other things, doing it this way allows us to provide more precise error reporting on malformed inputs.
The cost, in terms of lines of code, of doing this is rather modest, and the result seems at least as clear -- there's some mental cost to code switching between the two worlds. So let's make the switch and parse the grammar without regular expressions.
We verified that the rendered output of the Reference is byte identical before and after this change.
cc @ehuss