Skip to content

trafilatura-1.11.0

Compare
Choose a tag to compare
@adbar adbar released this 27 Jun 14:04
· 64 commits to master since this release
60647e5

Breaking change:

  • metadata now skipped by default (#613), to trigger inclusion in all output formats:
    • with_metadata=True (Python)
    • --with-metadata (CLI)

Extraction:

  • add HTML as output format (#614)
  • better and faster baseline extraction (#619)
  • better handling of HTML/XML elements (#628)
  • XPath rules added with @felipehertzer (#540)
  • fix: avoid faulty readability_lxml content (#635)

Evaluation:

Maintenance:

  • docs extended and updated, added page on deduplication (#618)
  • review code, add tests and types in part of the submodules (#620, #623, #624, #625)