Breaking changes:
- Python 3.6 and 3.7 deprecated (#709)
bare_extraction()
:- now returns an instance of the
Document
class by default as_dict
deprecation warning → use.as_dict()
method on return value (#730)
- now returns an instance of the
bare_extraction()
andextract()
:no_fallback
deprecation warning → usefast
instead (#730)- downloads: remove
decode
argument infetch_url()
→ usefetch_response
instead (#724) - deprecated graphical user interface now removed (#713)
- extraction: move
max_tree_size
parameter tosettings.cfg
(#742) - use type hinting (#721, #723, #748)
- see Python and CLI deprecations in the docs
Fixes:
- set
options.source
before raising error on empty doc tree by @dmoklaf (#707) - robust encoding in
options.source
(#717) - more robust mapping for conversion to HTML (#721)
- CLI downloads: use all information in settings file (#734)
- downloads: cleaner urllib3 code (#736)
- refine table markdown output by @unsleepy22 (#752)
- extraction fix: images in text nodes by @unsleepy22 (#757)
Metadata:
- more robust URL extraction (#710)
Command-line interface:
- CLI: print URLs early for feeds and sitemaps with
--list
with @gremid (#744) - CLI: add 126 exit code for high error ratio (#747)
Maintenance:
- remove already deprecated functions and args (#716)
- add type hints (#723, #728)
- setup: use
pyproject.toml
file (#715) - simplify code (#708, #709, #727)
- better debug messages in
main_extractor
(#714) - evaluation: review data, update packages, add magic_html (#731)
- setup: explicit exports through
__all__
(#740) - tests: extend coverage (#753)
Documentation: