From b528514d9e6c5f69f927a85b2d8b550de50a9d58 Mon Sep 17 00:00:00 2001 From: Adrien Barbaresi Date: Tue, 19 Nov 2024 16:51:44 +0100 Subject: [PATCH] update changelog and corresponding doc pages --- HISTORY.md | 36 +++++++++++++++++++++++++++++++++--- docs/index.rst | 6 ++++++ docs/installation.rst | 5 ++++- docs/usage-cli.rst | 4 ++-- docs/usage-gui.rst | 4 ++++ docs/usage-python.rst | 9 +++++++-- 6 files changed, 56 insertions(+), 8 deletions(-) diff --git a/HISTORY.md b/HISTORY.md index 49c7bce7..6dba38be 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -4,9 +4,39 @@ ## future v2.0.0 Breaking changes: -- `bare_extraction()`: the function now returns an instance of the Document class by default -- `bare_extraction()`: `as_dict` deprecation warning → use `.as_dict()` method on return value -- `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead +- Python 3.6 and 3.7 deprecated (#709) +- `bare_extraction()`: + - now returns an instance of the `Document` class by default + - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) +- `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) +- downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) +- deprecated graphical user interface now removed (#713) +- extraction: move `max_tree_size` parameter to `settings.cfg` (#742) +- see [Python](https://trafilatura.readthedocs.io/en/latest/usage-python.html#deprecations) and [CLI](https://trafilatura.readthedocs.io/en/latest/usage-cli.html#deprecations) deprecations in the docs + +Fixes: +- set `options.source` before raising error on empty doc tree by @dmoklaf (#707) +- robust encoding in `options.source` (#717) +- more robust mapping for conversion to HTML (#721) +- CLI downloads: use all information in settings file (#734) +- downloads: cleaner urllib3 code (#736) +- CLI: print URLs early for feeds and sitemaps with `--list` with @gremid (#744) + +Metadata: +- more robust URL extraction (#710) + +Maintenance: +- remove already deprecated functions and args (#716) +- add type hints (#723, #728) +- setup: use `pyproject.toml` file (#715) +- simplify code (#708, #709, #727) +- better debug messages in `main_extractor` (#714) +- evaluation: review data, update packages, add magic_html (#731) +- setup: explicit exports through `__all__` (#740) + +Documentation: +- fix link in `docs/index.html` by @nzw0301 (#711) +- remove docs from published packages (#743) ## 1.12.2 diff --git a/docs/index.rst b/docs/index.rst index 4c04b363..02969c85 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -127,6 +127,12 @@ Contributions of all kinds are welcome. Visit the `Contributing page `_ who extended the docs or submitted bug reports, features and bugfixes! +Changes +------- + +For version history and changes see the `changelog `_. + + Context ------- diff --git a/docs/installation.rst b/docs/installation.rst index 80e7c055..e45bf4f7 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -72,6 +72,7 @@ In case this does not happen automatically, specify the version number: ``pip install trafilatura==number`` +- Last version for Python 3.6 and 3.7: ``1.12.2`` - Last version for Python 3.5: ``0.9.3`` - Last version for Python 3.4: ``0.8.2`` @@ -123,6 +124,8 @@ htmldate[all] / htmldate[speed] py3langid Language detection on extracted main text pycurl - Faster downloads, possibly less robust though + Faster downloads, useful where urllib3 fails +urllib3[socks] + Downloads through SOCKS proxy with urllib3 zstandard Additional compression algorithm for downloads diff --git a/docs/usage-cli.rst b/docs/usage-cli.rst index 9f416878..e5f75304 100644 --- a/docs/usage-cli.rst +++ b/docs/usage-cli.rst @@ -289,10 +289,10 @@ For all usage instructions see ``trafilatura -h``: [--parallel PARALLEL] [-b BLACKLIST] [--list] [-o OUTPUTDIR] [--backup-dir BACKUP_DIR] [--keep-dirs] [--feed [FEED] | --sitemap [SITEMAP] | --crawl [CRAWL] | - --explore [EXPLORE]] [--archived] + --explore [EXPLORE] | --probe [PROBE]] [--archived] [--url-filter URL_FILTER [URL_FILTER ...]] [-f] [--formatting] [--links] [--images] [--no-comments] - [--no-tables] [--only-with-metadata] + [--no-tables] [--only-with-metadata] [--with-metadata] [--target-language TARGET_LANGUAGE] [--deduplicate] [--config-file CONFIG_FILE] [--precision] [--recall] [--output-format {csv,json,html,markdown,txt,xml,xmltei} | diff --git a/docs/usage-gui.rst b/docs/usage-gui.rst index 4096ebcd..8b99896c 100644 --- a/docs/usage-gui.rst +++ b/docs/usage-gui.rst @@ -8,6 +8,10 @@ Note that the GUI was a feature in Trafilatura until version 1.8.1, but it is cu If installation fails, usage on the command-line is recommended. +.. hint:: + This interface is removed until further notice starting from Trafilatura version 2, mostly due to issues with cross-platform tests and maintenance. + + Installation ~~~~~~~~~~~~ diff --git a/docs/usage-python.rst b/docs/usage-python.rst index 0022e5bf..f2a2216f 100644 --- a/docs/usage-python.rst +++ b/docs/usage-python.rst @@ -481,7 +481,12 @@ Deprecations The following functions and arguments are deprecated: -- extraction: ``process_record()`` function → use ``extract()`` instead +- extraction: + - ``process_record()`` function → use ``extract()`` instead + - ``csv_output``, ``json_output``, ``tei_output``, ``xml_output`` → use ``output_format`` parameter instead + - ``bare_extraction(as_dict=True)`` → the function returns a ``Document`` object, use ``.as_dict()`` method on it + - ``bare_extraction()`` and ``extract()``: ``no_fallback`` → use ``fast`` instead + - ``max_tree_size`` parameter moved to ``settings.cfg`` file +- downloads: ``decode`` argument in ``fetch_url()`` → use ``fetch_response`` instead - utils: ``decode_response()`` function → use ``decode_file()`` instead -- extraction: ``csv_output``, ``json_output``, ``tei_output``, ``xml_output`` → use ``output_format`` parameter instead - metadata: ``with_metadata`` (include metadata) had once the effect of today's ``only_with_metadata`` (only documents with necessary metadata)