Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span' #234

Open
sentry-io bot opened this issue Feb 27, 2025 · 13 comments
Assignees

Comments

@sentry-io
Copy link

sentry-io bot commented Feb 27, 2025

Sentry Issue: COURTLISTENER-9AB

AttributeError: 'NoneType' object has no attribute 'span'
(15 additional frame(s) were not displayed)
...
  File "cl/api/utils.py", line 350, in initial
    super().initial(request, *args, **kwargs)
  File "cl/api/utils.py", line 599, in allow_request
    self.throttle_request_by_citation_count(request, view)
  File "cl/api/utils.py", line 569, in throttle_request_by_citation_count
    self.save_citation_count(request, view)
  File "cl/api/utils.py", line 576, in save_citation_count
    citation_count = self.get_citation_count_from_request(request, view)
  File "cl/api/utils.py", line 521, in get_citation_count_from_request
    eyecite.get_citations(text, tokenizer=HYPERSCAN_TOKENIZER)

Six instances so far.

Filed by @mlissner

@mlissner
Copy link
Member

Eduardo, can you analyze this, please, to see how big a deal it is?

@flooie
Copy link
Contributor

flooie commented Feb 27, 2025

I take that back -

@mlissner
Copy link
Member

Ah ha. Let's put it on your backlog then, instead of Eduardo's. How big an impact do you think this has?

@flooie
Copy link
Contributor

flooie commented Feb 27, 2025

I did a quick look and it looks eyecite related - but Im not sure its related to what I just added yesterday, but as its eyecite I think we should investigate

@flooie
Copy link
Contributor

flooie commented Feb 27, 2025

Just to complete the thought, I noticed that the citation lookup yesterday didnt convert the reporter to the corrected reporter which is necessary to do a proper citation lookup.

@flooie
Copy link
Contributor

flooie commented Feb 27, 2025

This looks like a bug in Eyecite, potentially due to a reporters-db regex pattern or old timey citation. However, the stack trace doesn’t include the citation being processed, making it unclear how to fix. We need a way to capture the failing input for further debugging.

In the meantime, I am going to move this issue to eyecite

I've tested a few variants, volume nominative's non standard volumes as well but nothing to replicate the bug yet.

@flooie flooie transferred this issue from freelawproject/courtlistener Feb 27, 2025
@mlissner
Copy link
Member

Thanks Bill. It sounds like since it's an eyecite bug, that's your domain, but should we also open a bug for the API not looking things up properly?

@grossir
Copy link
Contributor

grossir commented Feb 28, 2025

From this related Sentry issue I got a reproducible example. Seems to be a Hyperscan error due to a corrupted document. Will look for more examples; but maybe the user is introducing some strange characters?

from eyecite import get_citations, clean_text
from eyecite.tokenizers import HyperscanTokenizer
import requests
HYPERSCAN_TOKENIZER = HyperscanTokenizer(cache_dir=".hyperscan")


r = requests.get("https://www.courtlistener.com/api/rest/v4/recap-documents/429621284", headers={"Authorization": f"Token {token}"})
document = r.json()
text = document['plain_text']
cleaned_text = clean_text(text, ["all_whitespace"])

# this fails with AttributeError: 'NoneType' object has no attribute 'span'
citations = get_citations(
        cleaned_text, tokenizer=HYPERSCAN_TOKENIZER
    )

# these don't fail
citations = get_citations(cleaned_text)
citations = get_citations(cleaned_text[:1128312], tokenizer=HYPERSCAN_TOKENIZER)

# the document's text after the failing index has a bunch of binary like characters?
# if you fish into the exception using %pdb, you can get the offset character where this is failing
# it's 1128312
In [34]: cleaned_text[1128312:1128312+300]
Out[34]: ' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c \u202bڋ\u202a-*%\x0f\x10\x04\x05%\x08V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0.\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038b\u038b\u202cڋ\u202a- V\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?%\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038bşİ\u038b\u202cڋ\u202a-3V\u202c\u202c \u202bڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038b\x8fĚ\u038b Ȉ\u202cڋ\u202a-\x01V\u202c\u202c \u202bڋ\u202a- \x06\x06\x18 \x013V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ \u202a\x17"0AH\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038b\u038b Ȉ\u202cڋ\u202a-JV\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?'

@mlissner
Copy link
Member

Interesting. Could be somebody looking for vulnerabilities by sending us weird stuff. I guess if this only happens with wacky code like this it'd be nice to put in a little fix if that's possible, but if's only with bad input and fixing it is hard, maybe we just ignore it completely.

@flooie
Copy link
Contributor

flooie commented Feb 28, 2025

the real issue is that we are failing ourselves by allowing unprintable characters to get combined into a citation in the first place.

Copy link
Author

sentry-io bot commented Feb 28, 2025

Sentry Issue: COURTLISTENER-739

This one comes from RecapDocuments; Bill took a look and found they weird characters came from the scanned parts

https://www.courtlistener.com/docket/68197600/1/united-states-v-cellular-telephone-assigned-number-414-629-4401/
https://www.courtlistener.com/docket/4328332/10595/39/in-re-terrorist-attacks-on-september-11-2001/

All of them have scanned parts; that have been extracted as weird characters

[
# recap document id, offset
(424646788, 1368218),
(426392057, 12402),
(384413229, 10782)
]

Copy link
Author

sentry-io bot commented Feb 28, 2025

Sentry Issue: COURTLISTENER-8YJ

This one comes from a minimal example; it breaks the HyperscanTokenizer, but not the default one

get_citations("Shady Grove Farms \xa0v Goldsmith Seeds. 1981", tokenizer=HYPERSCAN_TOKENIZER)

@flooie
Copy link
Contributor

flooie commented Mar 4, 2025

This PR should fix this

#235

@flooie flooie moved this from To Do to PR'd Issues 🤞 in Case Law Sprint Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PR'd Issues 🤞
Development

No branches or pull requests

4 participants