From 83b69259e09a45acf7571bc19d6a02ee13968c8e Mon Sep 17 00:00:00 2001 From: Adrien Barbaresi Date: Fri, 3 Nov 2023 17:08:59 +0100 Subject: [PATCH] update docs --- docs/crawls.rst | 2 +- docs/installation.rst | 3 +++ docs/tutorial-dwds.rst | 2 +- docs/tutorial0.rst | 6 +++--- docs/tutorial1.rst | 4 ++-- 5 files changed, 10 insertions(+), 7 deletions(-) diff --git a/docs/crawls.rst b/docs/crawls.rst index 01cd949e..165fab52 100644 --- a/docs/crawls.rst +++ b/docs/crawls.rst @@ -114,7 +114,7 @@ On the CLI the crawler automatically works its way through a website, stopping a $ trafilatura --crawl "https://www.example.org" > links.txt -It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--inputfile`` option). +It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--input-file`` option). .. note:: The ``--list`` option does not apply here. Unlike with the ``--sitemap`` or ``--feed`` options, the URLs are simply returned as a list instead of being retrieved and processed. This happens in order to give a chance to examine the collected URLs prior to further downloads. diff --git a/docs/installation.rst b/docs/installation.rst index ae4b46f7..a83acc8f 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -61,6 +61,9 @@ This project is under active development, please make sure you keep it up-to-dat On **Mac OS** it can be necessary to install certificates by hand if you get errors like ``[SSL: CERTIFICATE_VERIFY_FAILED]`` while downloading webpages: execute ``pip install certifi`` and perform the post-installation step by clicking on ``/Applications/Python 3.X/Install Certificates.command``. For more information see this `help page on SSL errors `_. +.. hint:: + Installation on MacOS is generally easier with `brew `_. + Older Python versions ~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/tutorial-dwds.rst b/docs/tutorial-dwds.rst index c9e871c8..a45aeb44 100644 --- a/docs/tutorial-dwds.rst +++ b/docs/tutorial-dwds.rst @@ -114,7 +114,7 @@ Diese Linkliste kann zunächst gefiltert werden, um deutschsprachige, inhaltsrei Die Ausgabe von *Trafilatura* erfolgt auf zweierlei Weise: die extrahierten Texte (TXT-Format) im Verzeichnis ``ausgabe`` und eine Kopie der heruntergeladenen Webseiten unter ``html-quellen`` (zur Archivierung und ggf. erneuten Verarbeitung): -``trafilatura --inputfile linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/`` +``trafilatura --input-file linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/`` So werden TXT-Dateien ohne Metadaten ausgegeben. Wenn Sie ``--csv``, ``--json``, ``--xml`` oder ``--xmltei`` hinzufügen, werden Metadaten einbezogen und das entsprechende Format für die Ausgabe bestimmt. Zusätzliche Optionen sind verfügbar, siehe die passenden Dokumentationsseiten. diff --git a/docs/tutorial0.rst b/docs/tutorial0.rst index 504509dc..19e42e1c 100644 --- a/docs/tutorial0.rst +++ b/docs/tutorial0.rst @@ -171,8 +171,8 @@ Seamless download and processing Two major command line arguments are necessary here: -- ``-i`` or ``--inputfile`` to select an input list to read links from -- ``-o`` or ``--outputdir`` to define a directory to eventually store the results +- ``-i`` or ``--input-file`` to select an input list to read links from +- ``-o`` or ``--output-dir`` to define a directory to eventually store the results An additional argument can be useful in this context: @@ -213,6 +213,6 @@ Alternatively, you can download a series of web documents with generic command-l # download if necessary $ wget --directory-prefix=download/ --wait 5 --input-file=mylist.txt # process a directory with archived HTML files - $ trafilatura --inputdir download/ --outputdir corpus/ --xmltei --nocomments + $ trafilatura --input-dir download/ --output-dir corpus/ --xmltei --no-comments diff --git a/docs/tutorial1.rst b/docs/tutorial1.rst index f28041da..372cca86 100644 --- a/docs/tutorial1.rst +++ b/docs/tutorial1.rst @@ -26,8 +26,8 @@ For the collection and filtering of links see `this tutorial `_ Two major options are necessary here: -- ``-i`` or ``--inputfile`` to select an input list to read links from -- ``-o`` or ``--outputdir`` to define a directory to eventually store the results +- ``-i`` or ``--input-file`` to select an input list to read links from +- ``-o`` or ``--output-dir`` to define a directory to eventually store the results The input list will be read sequentially, and only lines beginning with a valid URL will be read; any other information contained in the file will be discarded.