-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
eventuallyThis is something we'll do eventually, but we don't know when, and it isn't on a critical path.This is something we'll do eventually, but we don't know when, and it isn't on a critical path.
Description
In #4 and FreeAndFair/TuskMobileVoting#60, we discussed the fact that the current output of the NLP tool is pretty rough; the raw output includes things like pieces of LaTeX equations, footnote markers, etc. I addressed this manually in #4 by running the combined histograms through an LLM with some manual cleanup stages ("eliminate everything that starts with a symbol", "eliminate everything that doesn't have at least one word in it", etc.), and also, for the verb phrases, had it coalesce phrases with the same primary verb. We should, for the future, consider some extensions to the NLP tool to:
- automatically do the kind of cleanup I did manually, either via an LLM API or programmatically where that is straightforward
- perform better OCR on PDF files to ensure that odd kerning and LaTeX artifacts don't cause misreadings (this is much harder than it sounds, and is likely far too much effort for us to attempt any time soon)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
eventuallyThis is something we'll do eventually, but we don't know when, and it isn't on a critical path.This is something we'll do eventually, but we don't know when, and it isn't on a critical path.