Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad regex in space delimited parser causes Lute to not start! #520

Open
jzohrab opened this issue Nov 23, 2024 · 0 comments
Open

Bad regex in space delimited parser causes Lute to not start! #520

jzohrab opened this issue Nov 23, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@jzohrab
Copy link
Collaborator

jzohrab commented Nov 23, 2024

From Discord, a user (Shijui) configured the space delimited parser like this:

8|Japanese|´='|`='|’='|‘='|...=…|..=‥|.!?。?!|Mr.|Mrs.|Dr.|[A-Z].|Vd.|Vds.|\p{Han}\p{Katakana}\p{Hiragana}|0|1|spacedel

which caused Lute to fail at startup:

  File "/Users/jeff/Documents/Projects/lute-v3/lute/db/data_cleanup.py", line 39, in clean_data
    _set_texts_word_count(session)
  File "/Users/jeff/Documents/Projects/lute-v3/lute/db/data_cleanup.py", line 30, in _set_texts_word_count
    pt = t.book.language.get_parsed_tokens(t.text)
  File "/Users/jeff/Documents/Projects/lute-v3/lute/models/language.py", line 127, in get_parsed_tokens
    return self.parser.get_parsed_tokens(s, self)
  File "/Users/jeff/Documents/Projects/lute-v3/lute/parse/space_delimited_parser.py", line 169, in get_parsed_tokens
    return self._parse_to_tokens(clean_text, language)
  File "/Users/jeff/Documents/Projects/lute-v3/lute/parse/space_delimited_parser.py", line 203, in _parse_to_tokens
    self.parse_para(para, lang, tokens)
  File "/Users/jeff/Documents/Projects/lute-v3/lute/parse/space_delimited_parser.py", line 222, in parse_para
    m = self.preg_match_capture(pattern, text)
..... etc etc
    raise source.error('bad escape %s' % escape, len(escape))
re.error: bad escape \p at position 37

I added a print to the space del parser

$ git diff
diff --git a/lute/parse/space_delimited_parser.py b/lute/parse/space_delimited_parser.py
index 0ef26979..e9bb0ba8 100644
--- a/lute/parse/space_delimited_parser.py
+++ b/lute/parse/space_delimited_parser.py
@@ -34,6 +34,7 @@ class SpaceDelimitedParser(AbstractParser):
     @functools.lru_cache
     def compile_re_pattern(pattern: str, *args, **kwargs) -> re.Pattern:
         """Compile regular expression pattern, cache result for fast re-use."""
+        print(pattern, flush=True)
         return re.compile(pattern, *args, **kwargs)

and got

(Mr\.|Mrs\.|Dr\.|[A-Z]\.|Vd\.|Vds\.|[\p{Han}\p{Katakana}\p{Hiragana}]*)

It didn't like the "\p" -- but that shouldn't break the entire parser, and shouldn't cause Lute to completely die.

@jzohrab jzohrab added the bug Something isn't working label Nov 23, 2024
@jzohrab jzohrab added this to Lute-v3 Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

1 participant