Parse grammar without regexes #1827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ehuss merged 1 commit into master from TC/parse-without-regex

May 17, 2025

Contributor

traviscross commented May 16, 2025 •

edited

Loading

We'd been parsing the grammar with a combination of recursive descent and regular expression matchers. This combination has its merits, and it's done tastefully here, but it seems maybe more straightforward to do the parsing entirely with recursive descent. Among other things, doing it this way allows us to provide more precise error reporting on malformed inputs.

The cost, in terms of lines of code, of doing this is rather modest, and the result seems at least as clear -- there's some mental cost to code switching between the two worlds. So let's make the switch and parse the grammar without regular expressions.

We verified that the rendered output of the Reference is byte identical before and after this change.

traviscross force-pushed the TC/parse-without-regex branch 2 times, most recently from 9525e43 to 2786fb4 Compare

May 16, 2025 07:15

traviscross commented

View reviewed changes

mdbook-spec/src/grammar/parser.rs

    
                      loop {

                          self.space0();

                          let Some(ch) = self.parse_characters() else {

                          let Some(ch) = self.parse_characters()? else {

Contributor Author

traviscross May 16, 2025

We now notice and report errors when parsing the elements of a character class.

mdbook-spec/src/grammar/parser.rs

    
                                  self.index = recov + 1;

                                  bail!(self, "invalid start terminal in range");

Contributor Author

traviscross May 16, 2025

We recover the parser index before reporting errors so as to put the arrow in the correct place.

mdbook-spec/src/grammar/parser.rs

Comment on lines -348 to +365

    
                      static UNICODE_RE: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"^[A-Z0-9]{4}").unwrap());

                      match self.take_re(&UNICODE_RE) {

                          Some(s) => Ok(ExpressionKind::Unicode(s[0].to_string())),

                          None => bail!(self, "expected 4 hexadecimal uppercase digits after U+"),

                      let mut xs = Vec::with_capacity(4);

                      for _ in 0..4 {

                          match self.peek() {

                              Some(x @ (b'0'..=b'9' | b'A'..=b'F')) => {

Contributor Author

traviscross May 16, 2025

The old code was accepting A-Z in a hex digit. We narrow this down to A-F.

mdbook-spec/src/grammar/parser.rs

    
                      match self.take_re(&FOOTNOTE_RE) {

                          Some(cap) => Ok(Some(cap[1].to_string())),

                          None => bail!(self, "unterminated footnote, expected closing `]`"),

                      let id = self.take_while(&|x| !['\n', ']'].contains(&x)).to_string();

Contributor Author

traviscross May 16, 2025

Really I'd prefer this to_string call to happen at the bottom, but alas, this is one of those borrow checker limitations.

mdbook-spec/src/grammar/parser.rs

    
                      static PROSE_RE: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"^<([^>\n]+)>").unwrap());

                      match self.take_re(&PROSE_RE) {

                          Some(cap) => Ok(ExpressionKind::Prose(cap[1].to_string())),

                          None => bail!(self, "unterminated prose, expected closing `>`"),

Contributor Author

traviscross May 16, 2025

Here and elsewhere, we were incorrectly reporting things being unterminated when they were actually empty. This is now fixed.

traviscross marked this pull request as ready for review

May 16, 2025 07:22

rustbot added the S-waiting-on-review label


          Parse grammar without regexes

1a7304b

We'd been parsing the grammar with a combination of recursive descent
and regular expression matchers.  This combination has its merits, and
it's done tastefully here, but it seems maybe more straightforward to
do the parsing entirely with recursive descent.  Among other things,
doing it this way allows us to provide more precise error reporting on
malformed inputs.

The cost, in terms of lines of code, of doing this is rather modest,
and the result seems at least as clear -- there's some mental cost to
code switching between the two worlds.  So let's make the switch and
parse the grammar without regular expressions.

We verified that the rendered output of the Reference is byte
identical before and after this change.

traviscross force-pushed the TC/parse-without-regex branch from 2786fb4 to 1a7304b Compare

May 16, 2025 07:28

ehuss approved these changes

View reviewed changes

Contributor

ehuss left a comment

Thanks!

Yea, the intent was to reduce LOC at the expense of worse error messages, but this seems fine to me.

ehuss added this pull request to the merge queue

Merged via the queue into master with commit 83fec9a

5 checks passed

rustbot mentioned this pull request

Update books rust-lang/rust#141259

Merged

Zalathar added a commit to Zalathar/rust that referenced this pull request


          Rollup merge of rust-lang#141259 - rustbot:docs-update, r=ehuss

3aec6fa

Update books

## rust-lang/book

4 commits in d33916341d480caede1d0ae57cbeae23aab23e88..230c68bc1e08f5f3228384a28cc228c81dfbd10d
2025-05-19 14:25:14 UTC to 2025-05-08 21:28:56 UTC

- Chapter 6 from tech review (rust-lang/book#4370)
- Chapter 5 from tech review (rust-lang/book#4359)
- Chapter 4 from tech review (rust-lang/book#4358)
- Chapter 3 from tech review (rust-lang/book#4353)

## rust-lang/reference

12 commits in 387392674d74656f7cb437c05a96f0c52ea8e601..acd0231ebc74849f6a8907b5e646ce86721aad76
2025-05-19 15:41:22 UTC to 2025-05-06 21:36:01 UTC

- Add doc for avx512 target features (rust-lang/reference#1778)
- Parse grammar without regexes (rust-lang/reference#1827)
- Parse optionals and repeats without regexes (rust-lang/reference#1826)
- Fix grammar for `RangePatternBound` regarding literals (rust-lang/reference#1825)
- Fix grammar for `LiteralPattern` regarding `-` (rust-lang/reference#1824)
- Doc: Add the LoongArch stabilized target features (rust-lang/reference#1707)
- Fix naked em-dash (rust-lang/reference#1820)
- Add missing attribute for statement macros (rust-lang/reference#1819)
- Make linked rules are clicked, highlight the color (rust-lang/reference#1817)
- Use the reference grammar for inline assembly (rust-lang/reference#1807)
- Fix typo in introduction (rust-lang/reference#1810)
- Add an example admonition (rust-lang/reference#1812)

## rust-lang/rust-by-example

2 commits in 8a8918c698534547fa8a1a693cb3e7277f0bfb2f..c9d151f9147c4808c77f0375ba3fa5d54443cb9e
2025-05-13 17:49:05 UTC to 2025-05-13 17:48:43 UTC

- fix(docs): standardize on `no_run` attribute for documentation examples (rust-lang/rust-by-example#1929)
- Fix typo in Japanese translation (rust-lang/rust-by-example#1928)

rust-timer added a commit to rust-lang-ci/rust that referenced this pull request


          Unrolled build for rust-lang#141259

4e69f24

Rollup merge of rust-lang#141259 - rustbot:docs-update, r=ehuss

Update books

## rust-lang/book

4 commits in d33916341d480caede1d0ae57cbeae23aab23e88..230c68bc1e08f5f3228384a28cc228c81dfbd10d
2025-05-19 14:25:14 UTC to 2025-05-08 21:28:56 UTC

- Chapter 6 from tech review (rust-lang/book#4370)
- Chapter 5 from tech review (rust-lang/book#4359)
- Chapter 4 from tech review (rust-lang/book#4358)
- Chapter 3 from tech review (rust-lang/book#4353)

## rust-lang/reference

12 commits in 387392674d74656f7cb437c05a96f0c52ea8e601..acd0231ebc74849f6a8907b5e646ce86721aad76
2025-05-19 15:41:22 UTC to 2025-05-06 21:36:01 UTC

- Add doc for avx512 target features (rust-lang/reference#1778)
- Parse grammar without regexes (rust-lang/reference#1827)
- Parse optionals and repeats without regexes (rust-lang/reference#1826)
- Fix grammar for `RangePatternBound` regarding literals (rust-lang/reference#1825)
- Fix grammar for `LiteralPattern` regarding `-` (rust-lang/reference#1824)
- Doc: Add the LoongArch stabilized target features (rust-lang/reference#1707)
- Fix naked em-dash (rust-lang/reference#1820)
- Add missing attribute for statement macros (rust-lang/reference#1819)
- Make linked rules are clicked, highlight the color (rust-lang/reference#1817)
- Use the reference grammar for inline assembly (rust-lang/reference#1807)
- Fix typo in introduction (rust-lang/reference#1810)
- Add an example admonition (rust-lang/reference#1812)

## rust-lang/rust-by-example

2 commits in 8a8918c698534547fa8a1a693cb3e7277f0bfb2f..c9d151f9147c4808c77f0375ba3fa5d54443cb9e
2025-05-13 17:49:05 UTC to 2025-05-13 17:48:43 UTC

- fix(docs): standardize on `no_run` attribute for documentation examples (rust-lang/rust-by-example#1929)
- Fix typo in Japanese translation (rust-lang/rust-by-example#1928)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review