Skip to content

Improve NLP Tool Output #8

@dmzimmerman

Description

@dmzimmerman

In #4 and FreeAndFair/TuskMobileVoting#60, we discussed the fact that the current output of the NLP tool is pretty rough; the raw output includes things like pieces of LaTeX equations, footnote markers, etc. I addressed this manually in #4 by running the combined histograms through an LLM with some manual cleanup stages ("eliminate everything that starts with a symbol", "eliminate everything that doesn't have at least one word in it", etc.), and also, for the verb phrases, had it coalesce phrases with the same primary verb. We should, for the future, consider some extensions to the NLP tool to:

  • automatically do the kind of cleanup I did manually, either via an LLM API or programmatically where that is straightforward
  • perform better OCR on PDF files to ensure that odd kerning and LaTeX artifacts don't cause misreadings (this is much harder than it sounds, and is likely far too much effort for us to attempt any time soon)

Metadata

Metadata

Assignees

No one assigned

    Labels

    eventuallyThis is something we'll do eventually, but we don't know when, and it isn't on a critical path.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions