Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span' #234

sentry-io · 2025-02-27T01:15:05Z

AttributeError: 'NoneType' object has no attribute 'span'
(15 additional frame(s) were not displayed)
...
  File "cl/api/utils.py", line 350, in initial
    super().initial(request, *args, **kwargs)
  File "cl/api/utils.py", line 599, in allow_request
    self.throttle_request_by_citation_count(request, view)
  File "cl/api/utils.py", line 569, in throttle_request_by_citation_count
    self.save_citation_count(request, view)
  File "cl/api/utils.py", line 576, in save_citation_count
    citation_count = self.get_citation_count_from_request(request, view)
  File "cl/api/utils.py", line 521, in get_citation_count_from_request
    eyecite.get_citations(text, tokenizer=HYPERSCAN_TOKENIZER)

Six instances so far.

Filed by @mlissner

The text was updated successfully, but these errors were encountered:

mlissner · 2025-02-27T01:15:43Z

Eduardo, can you analyze this, please, to see how big a deal it is?

flooie · 2025-02-27T18:51:44Z

I take that back -

mlissner · 2025-02-27T18:53:48Z

Ah ha. Let's put it on your backlog then, instead of Eduardo's. How big an impact do you think this has?

flooie · 2025-02-27T18:54:28Z

I did a quick look and it looks eyecite related - but Im not sure its related to what I just added yesterday, but as its eyecite I think we should investigate

flooie · 2025-02-27T18:56:23Z

Just to complete the thought, I noticed that the citation lookup yesterday didnt convert the reporter to the corrected reporter which is necessary to do a proper citation lookup.

flooie · 2025-02-27T21:20:56Z

This looks like a bug in Eyecite, potentially due to a reporters-db regex pattern or old timey citation. However, the stack trace doesn’t include the citation being processed, making it unclear how to fix. We need a way to capture the failing input for further debugging.

In the meantime, I am going to move this issue to eyecite

I've tested a few variants, volume nominative's non standard volumes as well but nothing to replicate the bug yet.

mlissner · 2025-02-28T00:27:05Z

Thanks Bill. It sounds like since it's an eyecite bug, that's your domain, but should we also open a bug for the API not looking things up properly?

grossir · 2025-02-28T16:45:51Z

From this related Sentry issue I got a reproducible example. Seems to be a Hyperscan error due to a corrupted document. Will look for more examples; but maybe the user is introducing some strange characters?

from eyecite import get_citations, clean_text
from eyecite.tokenizers import HyperscanTokenizer
import requests
HYPERSCAN_TOKENIZER = HyperscanTokenizer(cache_dir=".hyperscan")


r = requests.get("https://www.courtlistener.com/api/rest/v4/recap-documents/429621284", headers={"Authorization": f"Token {token}"})
document = r.json()
text = document['plain_text']
cleaned_text = clean_text(text, ["all_whitespace"])

# this fails with AttributeError: 'NoneType' object has no attribute 'span'
citations = get_citations(
        cleaned_text, tokenizer=HYPERSCAN_TOKENIZER
    )

# these don't fail
citations = get_citations(cleaned_text)
citations = get_citations(cleaned_text[:1128312], tokenizer=HYPERSCAN_TOKENIZER)

# the document's text after the failing index has a bunch of binary like characters?
# if you fish into the exception using %pdb, you can get the offset character where this is failing
# it's 1128312
In [34]: cleaned_text[1128312:1128312+300]
Out[34]: ' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c \u202bڋ\u202a-*%\x0f\x10\x04\x05%\x08V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0.\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038bhǦ\u038b\u202cڋ\u202a- V\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?%\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038bşİ\u038b\u202cڋ\u202a-3V\u202c\u202c \u202bڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038b\x8fĚ\u038b Ȉ\u202cڋ\u202a-\x01V\u202c\u202c \u202bڋ\u202a- \x06\x06\x18 \x013V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ \u202a\x17"0AH\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038bwİ\u038b Ȉ\u202cڋ\u202a-JV\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?'

mlissner · 2025-02-28T16:53:07Z

Interesting. Could be somebody looking for vulnerabilities by sending us weird stuff. I guess if this only happens with wacky code like this it'd be nice to put in a little fix if that's possible, but if's only with bad input and fixing it is hard, maybe we just ignore it completely.

flooie · 2025-02-28T18:25:31Z

the real issue is that we are failing ourselves by allowing unprintable characters to get combined into a citation in the first place.

sentry-io · 2025-02-28T18:39:00Z

Sentry Issue: COURTLISTENER-739

This one comes from RecapDocuments; Bill took a look and found they weird characters came from the scanned parts

https://www.courtlistener.com/docket/68197600/1/united-states-v-cellular-telephone-assigned-number-414-629-4401/
https://www.courtlistener.com/docket/4328332/10595/39/in-re-terrorist-attacks-on-september-11-2001/

All of them have scanned parts; that have been extracted as weird characters

[
# recap document id, offset
(424646788, 1368218),
(426392057, 12402),
(384413229, 10782)
]

sentry-io · 2025-02-28T18:42:25Z

Sentry Issue: COURTLISTENER-8YJ

This one comes from a minimal example; it breaks the HyperscanTokenizer, but not the default one

get_citations("Shady Grove Farms \xa0v Goldsmith Seeds. 1981", tokenizer=HYPERSCAN_TOKENIZER)

flooie · 2025-03-04T18:00:24Z

This PR should fix this

#235

sentry-io bot assigned ERosendo Feb 27, 2025

mlissner added this to Sprint (Web Team) Feb 27, 2025

mlissner moved this to To Do in Sprint (Web Team) Feb 27, 2025

mlissner assigned flooie and unassigned ERosendo Feb 27, 2025

mlissner added this to Case Law Sprint Feb 27, 2025

mlissner moved this to To Do in Case Law Sprint Feb 27, 2025

mlissner removed this from Sprint (Web Team) Feb 27, 2025

flooie transferred this issue from freelawproject/courtlistener Feb 27, 2025

flooie moved this from To Do to PR'd Issues 🤞 in Case Law Sprint Mar 4, 2025

grossir mentioned this issue Mar 4, 2025

Add fix for hyperscan tokenizer #235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span' #234

Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span' #234

sentry-io bot commented Feb 27, 2025

mlissner commented Feb 27, 2025

flooie commented Feb 27, 2025 •

edited

Loading

mlissner commented Feb 27, 2025

flooie commented Feb 27, 2025

flooie commented Feb 27, 2025

flooie commented Feb 27, 2025

mlissner commented Feb 28, 2025

grossir commented Feb 28, 2025 •

edited

Loading

mlissner commented Feb 28, 2025

flooie commented Feb 28, 2025

sentry-io bot commented Feb 28, 2025 •

edited by grossir

Loading

sentry-io bot commented Feb 28, 2025

flooie commented Mar 4, 2025

Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span' #234

Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span' #234

Comments

sentry-io bot commented Feb 27, 2025

mlissner commented Feb 27, 2025

flooie commented Feb 27, 2025 • edited Loading

mlissner commented Feb 27, 2025

flooie commented Feb 27, 2025

flooie commented Feb 27, 2025

flooie commented Feb 27, 2025

mlissner commented Feb 28, 2025

grossir commented Feb 28, 2025 • edited Loading

mlissner commented Feb 28, 2025

flooie commented Feb 28, 2025

sentry-io bot commented Feb 28, 2025 • edited by grossir Loading

sentry-io bot commented Feb 28, 2025

flooie commented Mar 4, 2025

flooie commented Feb 27, 2025 •

edited

Loading

grossir commented Feb 28, 2025 •

edited

Loading

sentry-io bot commented Feb 28, 2025 •

edited by grossir

Loading