Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fix for hyperscan tokenizer #235

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ on:

jobs:
build:
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
Expand Down
5 changes: 3 additions & 2 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,11 @@ Features:
- None

Changes:
- None
- Modifies section regex and puctuation regex

Fixes:
- None
- Fixes github action
- Fixed an issue in `extract_tokens()` where regex re-run could fail on certain Hyperscan matches.


## Current
Expand Down
5 changes: 4 additions & 1 deletion eyecite/clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,10 @@ def all_whitespace(text: str) -> str:
Returns:
Text with collapsed whitespace characters.
"""
return re.sub(r"\s+", " ", text)
WHITESPACE_REGEX = (
r"[ \t\n\r\f\v\u00A0\u2002\u2003\u2009\u200B\u202F\u205F]+"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only character in the list that is not included in r"\s" is \u200b, so I would suggest doing

WHITESPACE_REGEX = "[u200b\s]+"

In [18]: [i for i in list("\t\n\r\f\v\u00A0\u2002\u2003\u2009\u200B\u202F\u205F") if not i.isspace()]
Out[18]: ['\u200b']

or

In [26]: re.sub(r"\s+", "", "\t\n\r\f\v\u00A0\u2002\u2003\u2009\u200B\u202F\u205F")
Out[26]: '\u200b'

This

  • to be more clear;
  • because maybe "\s" contains more characters than what you are listing?

)
return re.sub(WHITESPACE_REGEX, " ", text)


def underscores(text: str) -> str:
Expand Down
4 changes: 2 additions & 2 deletions eyecite/regexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def short_cite_re(regex):

# Regex to match punctuation around volume numbers and stopwords.
# This could potentially be more precise.
PUNCTUATION_REGEX = r"[^\sa-zA-Z0-9]*"
PUNCTUATION_REGEX = r"[^\sa-zA-Z0-9]{,3}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make these changes? do you have examples or comments on why?


# Regex for IdToken
ID_REGEX = space_boundaries_re(r"id\.,?|ibid\.")
Expand All @@ -79,7 +79,7 @@ def short_cite_re(regex):
)

# Regex for SectionToken
SECTION_REGEX = r"(\S*§\S*)"
SECTION_REGEX = space_boundaries_re(r"([\w\.\,\-]*§[\w\.\,\-]*)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escape \ not needed inside []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the benchmark, it seems we are losing these?

  • §15.50(a)
  • [§]52-249a


# Regex for ParagraphToken
PARAGRAPH_REGEX = r"(\n)"
Expand Down
3 changes: 3 additions & 0 deletions eyecite/tokenizers.py
Original file line number Diff line number Diff line change
Expand Up @@ -466,6 +466,9 @@ def on_match(index, start, end, flags, context):
start = byte_to_str_offset[start]
end = byte_to_str_offset[end]
m = extractor.compiled_regex.match(text[start:end])
if not m:
# skip if re-run regex fails to detect match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not expected to be common, right?
Why don't we put a logger.error here, so we can analyze what's going on in the edge cases

continue
yield extractor.get_token(m, offset=start)

@property
Expand Down
6 changes: 6 additions & 0 deletions tests/test_FindTest.py
Original file line number Diff line number Diff line change
Expand Up @@ -630,6 +630,12 @@ def test_find_citations(self):
metadata={'plaintiff': 'Commonwealth', 'defendant': 'Muniz',
'court': 'pa'})]),
('Foo v. Bar, 1 F.Supp. 1 (SC 1967)', [case_citation(volume='1', reporter='F.Supp.', year=1967, page='1', metadata={'plaintiff': 'Foo', 'defendant': 'Bar', 'court': 'sc'})]),
('Shady Grove Farms \xa0v Goldsmith Seeds 1 U.S. 1 (1981)', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an example for the whitespace change; however; this was already covered by the simple r"\s" regex as pointed in the other comment. I think you should try to cover the error cases we have seen. For example would still be broken

In [32]: clean_text(' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c', ["all_whitespace"])
Out[32]: ' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c'

case_citation(year=1981,
metadata={'defendant': 'Goldsmith Seeds',
'plaintiff': 'Farms',
'court': 'scotus'})],
{'clean': ['all_whitespace']}),
)
# fmt: on
self.run_test_pairs(test_pairs, "Citation extraction")
Expand Down
Loading