-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Citation Lookup API fails: AttributeError: 'NoneType' object has no attribute 'span' #234
Comments
Eduardo, can you analyze this, please, to see how big a deal it is? |
I take that back - |
Ah ha. Let's put it on your backlog then, instead of Eduardo's. How big an impact do you think this has? |
I did a quick look and it looks eyecite related - but Im not sure its related to what I just added yesterday, but as its eyecite I think we should investigate |
Just to complete the thought, I noticed that the citation lookup yesterday didnt convert the reporter to the corrected reporter which is necessary to do a proper citation lookup. |
This looks like a bug in Eyecite, potentially due to a reporters-db regex pattern or old timey citation. However, the stack trace doesn’t include the citation being processed, making it unclear how to fix. We need a way to capture the failing input for further debugging. In the meantime, I am going to move this issue to eyecite I've tested a few variants, volume nominative's non standard volumes as well but nothing to replicate the bug yet. |
Thanks Bill. It sounds like since it's an eyecite bug, that's your domain, but should we also open a bug for the API not looking things up properly? |
From this related Sentry issue I got a reproducible example. Seems to be a Hyperscan error due to a corrupted document. Will look for more examples; but maybe the user is introducing some strange characters? from eyecite import get_citations, clean_text
from eyecite.tokenizers import HyperscanTokenizer
import requests
HYPERSCAN_TOKENIZER = HyperscanTokenizer(cache_dir=".hyperscan")
r = requests.get("https://www.courtlistener.com/api/rest/v4/recap-documents/429621284", headers={"Authorization": f"Token {token}"})
document = r.json()
text = document['plain_text']
cleaned_text = clean_text(text, ["all_whitespace"])
# this fails with AttributeError: 'NoneType' object has no attribute 'span'
citations = get_citations(
cleaned_text, tokenizer=HYPERSCAN_TOKENIZER
)
# these don't fail
citations = get_citations(cleaned_text)
citations = get_citations(cleaned_text[:1128312], tokenizer=HYPERSCAN_TOKENIZER)
# the document's text after the failing index has a bunch of binary like characters?
# if you fish into the exception using %pdb, you can get the offset character where this is failing
# it's 1128312
In [34]: cleaned_text[1128312:1128312+300]
Out[34]: ' \x08*\x07\x07\u038bþİ\u038b\u202cڋ\u202a-\x14V\u202c\u202c \u202bڋ\u202a-*%\x0f\x10\x04\x05%\x08V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0.\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038bhǦ\u038b\u202cڋ\u202a- V\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?%\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x01) \u202a 1*\x07\x07\u038bşİ\u038b\u202cڋ\u202a-3V\u202c\u202c \u202bڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ *\u202a\x17"0A\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038b\x8fĚ\u038b Ȉ\u202cڋ\u202a-\x01V\u202c\u202c \u202bڋ\u202a- \x06\x06\x18 \x013V\u202cڋ\x06\x0e\x10\u202a\x17\x08%\u202cڋ\u202a,\x04\x053\u202cڋ \u202a\x17"0AH\u202cڋ \x04\x10\x05\x18\u202a \x08*\x07\x07\u038bwİ\u038b Ȉ\u202cڋ\u202a-JV\u202c\u202c \u202bڋ\u202a4\x08\x01 \x0f\x10\x04\x05V\u202cڋ\x01\x08\x14\u202a-(%?' |
Interesting. Could be somebody looking for vulnerabilities by sending us weird stuff. I guess if this only happens with wacky code like this it'd be nice to put in a little fix if that's possible, but if's only with bad input and fixing it is hard, maybe we just ignore it completely. |
the real issue is that we are failing ourselves by allowing unprintable characters to get combined into a citation in the first place. |
Sentry Issue: COURTLISTENER-739 This one comes from RecapDocuments; Bill took a look and found they weird characters came from the scanned parts https://www.courtlistener.com/docket/68197600/1/united-states-v-cellular-telephone-assigned-number-414-629-4401/ All of them have scanned parts; that have been extracted as weird characters
|
Sentry Issue: COURTLISTENER-8YJ This one comes from a minimal example; it breaks the HyperscanTokenizer, but not the default one get_citations("Shady Grove Farms \xa0v Goldsmith Seeds. 1981", tokenizer=HYPERSCAN_TOKENIZER) |
This PR should fix this |
Sentry Issue: COURTLISTENER-9AB
Six instances so far.
Filed by @mlissner
The text was updated successfully, but these errors were encountered: