Add fix for hyperscan tokenizer #235

flooie · 2025-03-04T17:55:04Z

Added an if not m: continue check to extract_tokens()
to prevent processing invalid matches that fail when
re-running the regex on extracted text.

Previously, Hyperscan detected matches based on byte
offsets, but some of these did not align properly
when converted to Unicode string offsets. This caused
.match(text[start:end]) to return None,
potentially leading to errors when calling
get_token(m, offset=start).

Now, we explicitly skip such cases to ensure
only valid tokens are processed.

Also - put some rationale constraints on the section regex
and punctuation regexes to avoid matching indefinitely on gibberish.

Limit punctuation regex to 3 or less punct. in a row ... no need to match on anything more Also reduce text around section regex when we have jibberish it would cause lots of bad matches - and unknown citations

Added an `if not m: continue` check to `extract_tokens()` to prevent processing invalid matches that fail when re-running the regex on extracted text. Previously, Hyperscan detected matches based on byte offsets, but some of these did not align properly when converted to Unicode string offsets. This caused `.match(text[start:end])` to return `None`, potentially leading to errors when calling `get_token(m, offset=start)`. Now, we explicitly skip such cases to ensure only valid tokens are processed.

mlissner · 2025-03-04T18:12:50Z

Feels like the kind of PR that should really have a test?

update all_whitespace to handle other whitespace characters not caputred in \s non breaking space em-space en-space thin space Add test

github-actions · 2025-03-04T18:52:50Z

The Eyecite Report 👁️

Gains and Losses

There were 95 gains and 0 losses.

Click here to see details.

id	Gain	Loss
4627877	§§
4639543	§§
4678352	§§
4690516	[§]
4709603	§§
4776531	§§
4799679	§§
5082333	§§
5167616	§§
5441809	(§)
5562033	§§3033,
5935198	§§
4536776	§§
6026339	(§
6026449	§§
6052697	§§
6308108	[§
6185261	(§
3073392	§§
2014564	(§
2060699	§§
6322259	[§]
6322259	§§
2303811	§§
755368	(§
3149869	§§
3149869	(§
1744543	§§
2257892	(§§
2257892	§§
2257892	(§
1897124	§§
1137818	(§
1537257	§§
1546016	"[§]52-249a
1546016	§§
1929026	(§
2143980	§§
2042257	§§
2357843	§§
2496102	§§
2427861	§§
1431414	§§
1431414	(§
2829354	§15.50(a).
2829354	§§
1613	“§
775078	violate§§
775078	§§
1308185	§§
2803607	§§
6593646	§§

Time Chart

Generated Files

Branch 1 Output
Branch 2 Output
Full Output CSV

grossir · 2025-03-04T21:08:50Z

eyecite/clean.py

@@ -78,7 +78,10 @@ def all_whitespace(text: str) -> str:
    Returns:
        Text with collapsed whitespace characters.
    """
-    return re.sub(r"\s+", " ", text)
+    WHITESPACE_REGEX = (
+        r"[ \t\n\r\f\v\u00A0\u2002\u2003\u2009\u200B\u202F\u205F]+"


the only character in the list that is not included in r"\s" is \u200b, so I would suggest doing

WHITESPACE_REGEX = "[u200b\s]+"

In [18]: [i for i in list("\t\n\r\f\v\u00A0\u2002\u2003\u2009\u200B\u202F\u205F") if not i.isspace()] Out[18]: ['\u200b']

or

In [26]: re.sub(r"\s+", "", "\t\n\r\f\v\u00A0\u2002\u2003\u2009\u200B\u202F\u205F") Out[26]: '\u200b'

This

to be more clear;

because maybe "\s" contains more characters than what you are listing?

grossir · 2025-03-04T21:10:45Z

eyecite/tokenizers.py

@@ -466,6 +466,9 @@ def on_match(index, start, end, flags, context):
                start = byte_to_str_offset[start]
                end = byte_to_str_offset[end]
                m = extractor.compiled_regex.match(text[start:end])
+                if not m:
+                    # skip if re-run regex fails to detect match


This is not expected to be common, right?
Why don't we put a logger.error here, so we can analyze what's going on in the edge cases

grossir · 2025-03-04T21:26:48Z

tests/test_FindTest.py

@@ -630,6 +630,12 @@ def test_find_citations(self):
                              metadata={'plaintiff': 'Commonwealth', 'defendant': 'Muniz',
                                        'court': 'pa'})]),
            ('Foo v. Bar,  1 F.Supp. 1 (SC 1967)', [case_citation(volume='1', reporter='F.Supp.', year=1967, page='1', metadata={'plaintiff': 'Foo', 'defendant': 'Bar', 'court': 'sc'})]),
+            ('Shady Grove Farms \xa0v Goldsmith Seeds 1 U.S. 1 (1981)', [


this is an example for the whitespace change; however; this was already covered by the simple r"\s" regex as pointed in the other comment. I think you should try to cover the error cases we have seen. For example would still be broken

In [32]: clean_text(' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c', ["all_whitespace"]) Out[32]: ' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c'

grossir · 2025-03-04T21:27:25Z

eyecite/regexes.py

@@ -52,7 +52,7 @@ def short_cite_re(regex):

 # Regex to match punctuation around volume numbers and stopwords.
 # This could potentially be more precise.
-PUNCTUATION_REGEX = r"[^\sa-zA-Z0-9]*"
+PUNCTUATION_REGEX = r"[^\sa-zA-Z0-9]{,3}"


why make these changes? do you have examples or comments on why?

grossir · 2025-03-04T21:27:41Z

eyecite/regexes.py

@@ -79,7 +79,7 @@ def short_cite_re(regex):
 )

 # Regex for SectionToken
-SECTION_REGEX = r"(\S*§\S*)"
+SECTION_REGEX = space_boundaries_re(r"([\w\.\,\-]*§[\w\.\,\-]*)")


escape \ not needed inside []

From the benchmark, it seems we are losing these?

§15.50(a)

[§]52-249a

flooie added 5 commits March 4, 2025 12:48

feat(section): Limit regexes

894e84a

Limit punctuation regex to 3 or less punct. in a row ... no need to match on anything more Also reduce text around section regex when we have jibberish it would cause lots of bad matches - and unknown citations

fix(tokenizer): Undo filter out none

a7c0422

fix(github actions): Update workflow file

e11c589

docs(changes): Update changes.md

88cb214

flooie changed the title ~~Add fix for hyper scan tokenizer~~ Add fix for hyperscan tokenizer Mar 4, 2025

flooie requested a review from grossir March 4, 2025 17:59

flooie mentioned this pull request Mar 4, 2025

Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span' #234

Open

flooie added 3 commits March 4, 2025 13:47

fix(clean): Update all whitespace

81a891b

update all_whitespace to handle other whitespace characters not caputred in \s non breaking space em-space en-space thin space Add test

fix(clean): Lint

763d0f8

fix(clean): flake8

2ea3101

grossir reviewed Mar 4, 2025

View reviewed changes

flooie self-assigned this Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fix for hyperscan tokenizer #235

Add fix for hyperscan tokenizer #235

flooie commented Mar 4, 2025

mlissner commented Mar 4, 2025

github-actions bot commented Mar 4, 2025

grossir Mar 4, 2025

grossir Mar 4, 2025

grossir Mar 4, 2025

grossir Mar 4, 2025

grossir Mar 4, 2025

grossir Mar 4, 2025

Add fix for hyperscan tokenizer #235

Are you sure you want to change the base?

Add fix for hyperscan tokenizer #235

Conversation

flooie commented Mar 4, 2025

mlissner commented Mar 4, 2025

github-actions bot commented Mar 4, 2025

The Eyecite Report 👁️

Gains and Losses

Time Chart

Generated Files

grossir Mar 4, 2025

Choose a reason for hiding this comment

grossir Mar 4, 2025

Choose a reason for hiding this comment

grossir Mar 4, 2025

Choose a reason for hiding this comment

grossir Mar 4, 2025

Choose a reason for hiding this comment

grossir Mar 4, 2025

Choose a reason for hiding this comment

grossir Mar 4, 2025

Choose a reason for hiding this comment