Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fix for hyperscan tokenizer #235

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

flooie
Copy link
Contributor

@flooie flooie commented Mar 4, 2025

Added an if not m: continue check to extract_tokens()
to prevent processing invalid matches that fail when
re-running the regex on extracted text.

Previously, Hyperscan detected matches based on byte
offsets, but some of these did not align properly
when converted to Unicode string offsets. This caused
.match(text[start:end]) to return None,
potentially leading to errors when calling
get_token(m, offset=start).

Now, we explicitly skip such cases to ensure
only valid tokens are processed.

Also - put some rationale constraints on the section regex
and punctuation regexes to avoid matching indefinitely on gibberish.

flooie added 5 commits March 4, 2025 12:48
Limit punctuation regex to 3 or less punct.
in a row ... no need to match on anything more

Also reduce text around section regex
when we have jibberish it would cause
lots of bad matches - and unknown citations
Added an `if not m: continue` check to `extract_tokens()`
to prevent processing invalid matches that fail when
re-running the regex on extracted text.

Previously, Hyperscan detected matches based on byte
offsets, but some of these did not align properly
when converted to Unicode string offsets. This caused
 `.match(text[start:end])` to return `None`,
 potentially leading to errors when calling
 `get_token(m, offset=start)`.

Now, we explicitly skip such cases to ensure
only valid tokens are processed.
@flooie flooie changed the title Add fix for hyper scan tokenizer Add fix for hyperscan tokenizer Mar 4, 2025
@flooie flooie requested a review from grossir March 4, 2025 17:59
@mlissner
Copy link
Member

mlissner commented Mar 4, 2025

Feels like the kind of PR that should really have a test?

flooie added 3 commits March 4, 2025 13:47
update all_whitespace to handle
other whitespace characters not caputred
in \s

non breaking space
em-space
en-space
thin space

Add test
Copy link
Contributor

github-actions bot commented Mar 4, 2025

The Eyecite Report 👁️

Gains and Losses

There were 95 gains and 0 losses.

Click here to see details.
id Gain Loss
4627877 §§
4639543 §§
4678352 §§
4690516 [§]
4709603 §§
4776531 §§
4799679 §§
5082333 §§
5167616 §§
5441809 (§)
5562033 §§3033,
5935198 §§
4536776 §§
6026339
6026449 §§
6052697 §§
6308108
6185261
3073392 §§
2014564
2060699 §§
6322259 [§]
6322259 §§
2303811 §§
755368
3149869 §§
3149869
1744543 §§
2257892 (§§
2257892 §§
2257892
1897124 §§
1137818
1537257 §§
1546016 "[§]52-249a
1546016 §§
1929026
2143980 §§
2042257 §§
2357843 §§
2496102 §§
2427861 §§
1431414 §§
1431414
2829354 §15.50(a).
2829354 §§
1613 Ҥ
775078 violate§§
775078 §§
1308185 §§
2803607 §§
6593646 §§

Time Chart

image

Generated Files

Branch 1 Output
Branch 2 Output
Full Output CSV

@@ -78,7 +78,10 @@ def all_whitespace(text: str) -> str:
Returns:
Text with collapsed whitespace characters.
"""
return re.sub(r"\s+", " ", text)
WHITESPACE_REGEX = (
r"[ \t\n\r\f\v\u00A0\u2002\u2003\u2009\u200B\u202F\u205F]+"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only character in the list that is not included in r"\s" is \u200b, so I would suggest doing

WHITESPACE_REGEX = "[u200b\s]+"

In [18]: [i for i in list("\t\n\r\f\v\u00A0\u2002\u2003\u2009\u200B\u202F\u205F") if not i.isspace()]
Out[18]: ['\u200b']

or

In [26]: re.sub(r"\s+", "", "\t\n\r\f\v\u00A0\u2002\u2003\u2009\u200B\u202F\u205F")
Out[26]: '\u200b'

This

  • to be more clear;
  • because maybe "\s" contains more characters than what you are listing?

@@ -466,6 +466,9 @@ def on_match(index, start, end, flags, context):
start = byte_to_str_offset[start]
end = byte_to_str_offset[end]
m = extractor.compiled_regex.match(text[start:end])
if not m:
# skip if re-run regex fails to detect match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not expected to be common, right?
Why don't we put a logger.error here, so we can analyze what's going on in the edge cases

@@ -630,6 +630,12 @@ def test_find_citations(self):
metadata={'plaintiff': 'Commonwealth', 'defendant': 'Muniz',
'court': 'pa'})]),
('Foo v. Bar, 1 F.Supp. 1 (SC 1967)', [case_citation(volume='1', reporter='F.Supp.', year=1967, page='1', metadata={'plaintiff': 'Foo', 'defendant': 'Bar', 'court': 'sc'})]),
('Shady Grove Farms \xa0v Goldsmith Seeds 1 U.S. 1 (1981)', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an example for the whitespace change; however; this was already covered by the simple r"\s" regex as pointed in the other comment. I think you should try to cover the error cases we have seen. For example would still be broken

In [32]: clean_text(' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c', ["all_whitespace"])
Out[32]: ' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c'

@@ -52,7 +52,7 @@ def short_cite_re(regex):

# Regex to match punctuation around volume numbers and stopwords.
# This could potentially be more precise.
PUNCTUATION_REGEX = r"[^\sa-zA-Z0-9]*"
PUNCTUATION_REGEX = r"[^\sa-zA-Z0-9]{,3}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make these changes? do you have examples or comments on why?

@@ -79,7 +79,7 @@ def short_cite_re(regex):
)

# Regex for SectionToken
SECTION_REGEX = r"(\S*§\S*)"
SECTION_REGEX = space_boundaries_re(r"([\w\.\,\-]*§[\w\.\,\-]*)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escape \ not needed inside []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the benchmark, it seems we are losing these?

  • §15.50(a)
  • [§]52-249a

@flooie flooie self-assigned this Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PRs to Review
Development

Successfully merging this pull request may close these issues.

3 participants