Skip to content

Commit

Permalink
update docs (#481)
Browse files Browse the repository at this point in the history
* update docs

* fix docs generation and add references

* add API
  • Loading branch information
adbar authored Jan 24, 2024
1 parent 5c2761e commit 379ddeb
Show file tree
Hide file tree
Showing 12 changed files with 107 additions and 53 deletions.
23 changes: 11 additions & 12 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Trafilatura: Discover and Extract Text Data on the Web
======================================================


.. image:: docs/trafilatura-logo.png
.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/trafilatura-logo.png
:alt: Trafilatura Logo
:align: center
:width: 60%
Expand Down Expand Up @@ -35,7 +35,7 @@ Trafilatura: Discover and Extract Text Data on the Web

|
.. image:: docs/trafilatura-demo.gif
.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/trafilatura-demo.gif
:alt: Demo as GIF image
:align: center
:width: 85%
Expand All @@ -48,10 +48,9 @@ Introduction

Trafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data**. It includes all necessary discovery and text processing components to perform **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to multiple commonly used formats.

Smart navigation and going from HTML bulk to essential parts can alleviate many problems related to text quality, first by **focusing on the actual content**, second by **avoiding the noise** caused by recurring elements (headers, footers etc.), and third by **making sense of the data** with information such as author and publication date. The extractor tries to strike a balance between limiting noise and including all valid parts. It also has to be **robust and reasonably fast** as it runs in production on millions of documents.

The tool's versatility makes it useful for a wide range of applications leveraging web content for knowledge discovery such as **quantitative and data-driven approaches**. It is relevant to anyone interested in language modeling, data mining, information extraction. Scraping-intensive use cases include search engine optimization, business analytics and information security. Trafilatura is used in the academic domain, chiefly for data acquisition in corpus linguistics, natural language processing, and computational social science.
Smart navigation and going from HTML bulk to essential parts can alleviate many problems related to text quality, by **focusing on the actual content**, **avoiding the noise** caused by recurring elements (headers, footers etc.), **making sense of the data** with selected information. The extractor is designed to be **robust and reasonably fast**, it runs in production on millions of documents.

The tool's versatility makes it useful for a wide range of applications leveraging web content for knowledge discovery such as **quantitative and data-driven approaches**. Trafilatura is used in the academic domain and beyond (e.g. in NLP, SEO, business analytics).

Features
~~~~~~~~
Expand Down Expand Up @@ -86,9 +85,9 @@ Features
Evaluation and alternatives
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Trafilatura consistently outperforms other open-source libraries in text extraction benchmarks, showcasing its efficiency and accuracy in extracting web content.
Trafilatura consistently outperforms other open-source libraries in text extraction benchmarks, showcasing its efficiency and accuracy in extracting web content. The extractor tries to strike a balance between limiting noise and including all valid parts.

For more detailed results see the `benchmark <https://trafilatura.readthedocs.io/en/latest/evaluation.html>`_ and `evaluation script <https://github.com/adbar/trafilatura/blob/master/tests/comparison.py>`_. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the *tests* directory.
For more detailed results see the `benchmark <https://trafilatura.readthedocs.io/en/latest/evaluation.html>`_. The results can be reproduced, see the `evaluation readme <https://github.com/adbar/trafilatura/blob/master/tests/README.rst>_` for instructions.

=============================== ========= ========== ========= ========= ======
750 documents, 2236 text & 2250 boilerplate segments (2022-05-18), Python 3.8
Expand All @@ -112,7 +111,8 @@ Other evaluations:
^^^^^^^^^^^^^^^^^^

- Most efficient open-source library in *ScrapingHub*'s `article extraction benchmark <https://github.com/scrapinghub/article-extraction-benchmark>`_
- Best overall tool according to Gaël Lejeune & Adrien Barbaresi, `Bien choisir son outil d'extraction de contenu à partir du Web <https://hal.archives-ouvertes.fr/hal-02768510v3/document>`_ (2020, PDF, French)
- Best overall tool according to `Bien choisir son outil d'extraction de contenu à partir du Web <https://hal.archives-ouvertes.fr/hal-02768510v3/document>`_ (Lejeune & Barbaresi 2020)
- Best single tool by ROUGE-LSum Mean F1 Page Scores in `An Empirical Comparison of Web Content Extraction Algorithms <https://webis.de/downloads/publications/papers/bevendorff_2023b.pdf>`_ (Bevendorff et al. 2023)


Usage and documentation
Expand All @@ -139,7 +139,7 @@ License

*Trafilatura* is distributed under the `GNU General Public License v3.0 <https://github.com/adbar/trafilatura/blob/master/LICENSE>`_. This license promotes collaboration in software development and ensures that Trafilatura's code remains publicly accessible.

If you wish to redistribute this library but are concerned about the license conditions, consider interacting `at arm's length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, multi-licensing with `compatible licenses <https://en.wikipedia.org/wiki/GNU_General_Public_License#Compatibility_and_multi-licensing>`_, or `contacting the author <#author>`_ for more options.
If you wish to redistribute this library but are concerned about the license conditions, consider interacting `at arms length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, combining with `compatible licenses <https://www.gnu.org/licenses/license-list.html#GPLCompatibleLicenses>`_, or `contacting the author <#author>`_ for more options.

For insights into GPL and free software licensing with emphasis on a business context, see `GPL and Free Software Licensing: What's in it for Business? <https://web.archive.org/web/20230127221311/https://www.techrepublic.com/article/gpl-and-free-software-licensing-whats-in-it-for-business/>`_

Expand Down Expand Up @@ -175,8 +175,7 @@ This work started as a PhD project at the crossroads of linguistics and NLP, thi
Citing Trafilatura
~~~~~~~~~~~~~~~~~~


If you use Trafilatura in your research or projects, we kindly ask you to cite this work, here is how:
Trafilatura is used in the academic domain, chiefly for data acquisition in corpus linguistics, natural language processing, and computational social science. Here is how to cite it:

.. image:: https://img.shields.io/badge/DOI-10.18653%2Fv1%2F2021.acl--demo.15-blue
:target: https://aclanthology.org/2021.acl-demo.15/
Expand Down Expand Up @@ -207,7 +206,7 @@ This software is part of a larger ecosystem. It is employed in a variety of acad
Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis:


.. image:: docs/software-ecosystem.png
.. image:: https://raw.githubusercontent.com/adbar/htmldate/master/docs/software-ecosystem.png
:alt: Software ecosystem
:align: center
:width: 65%
Expand Down
11 changes: 8 additions & 3 deletions docs/corefunctions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,10 +77,15 @@ Helpers

.. autofunction:: trafilatura.fetch_url

``decode_response()``
~~~~~~~~~~~~~~~~~~~~~
``fetch_response()``
~~~~~~~~~~~~~~~~~~~~

.. autofunction:: trafilatura.fetch_response

``decode_file()``
~~~~~~~~~~~~~~~~~

.. autofunction:: trafilatura.utils.decode_response
.. autofunction:: trafilatura.utils.decode_file

``load_html()``
~~~~~~~~~~~~~~~
Expand Down
22 changes: 11 additions & 11 deletions docs/downloads.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,6 @@ This documentation page shows how to run simple downloads and how to configure a
A main objective of data collection over the Internet such as web crawling is to efficiently gather as many useful web pages as possible. In order to retrieve multiples web pages at once it makes sense to retrieve as many domains as possible in parallel. However, particular rules apply then.


*New in version 0.9: Functions exposed and made usable for convenience.*


With Python
-----------

Expand Down Expand Up @@ -43,25 +40,28 @@ Running simple downloads is straightforward with the ``fetch_url()`` fonction. T
For efficiency reasons the function makes use of a connection pool where connections are kept open (unless too many websites are retrieved at once). You may see warnings in logs about it which you can safely ignore.


``RawResponse`` object
~~~~~~~~~~~~~~~~~~~~~~

The content (stored here in the variable ``downloaded``) is seamlessly decoded to a Unicode string.
``Response`` object
~~~~~~~~~~~~~~~~~~~

This default setting can be overriden using the ``decode=False`` parameter. ``fetch_url()`` then returns a `urllib3-like response object <https://urllib3.readthedocs.io/en/latest/user-guide.html#response-content>`_ providing additional information.
The content retrieved by ``fetch_url()`` (stored here in the variable ``downloaded``) is seamlessly decoded to a Unicode string.

This ``RawResponse`` object comprises the attributes ``data``, ``status``, and ``url`` which can be accessed as follows:
Using the ``fetch_response()`` function instead provides access to more information stored in a ``Response`` object which comprises the attributes ``data`` (bytestring), ``headers`` (optinal dict), ``html`` (optional str), ``status``, and ``url``:

.. code-block:: python
# RawResponse object instead of Unicode string
>>> response = fetch_url('https://www.example.org', decode=False)
# Response object instead of Unicode string
>>> response = fetch_response('https://www.example.org')
>>> response.status
200
>>> response.url
'https://www.example.org'
>>> response.data
# raw HTML in binary format
>>> response = fetch_response('https://www.example.org', decode=True, with_headers=True)
# headers and html attributes used
.. note::
New in version 1.7.0.


Trafilatura-backed parallel threads
Expand Down
7 changes: 4 additions & 3 deletions docs/evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,10 @@ trafilatura 1.2.2 (standard) 0.914 0.904 **0.910** **0.909** 7.1x
External evaluations
--------------------

- Trafilatura is the most efficient open-source library in *ScrapingHub*'s `article extraction benchmark <https://github.com/scrapinghub/article-extraction-benchmark>`_.
- Best overall tool according to Gaël Lejeune & Adrien Barbaresi, `Bien choisir son outil d'extraction de contenu à partir du Web <https://hal.archives-ouvertes.fr/hal-02768510v3/document>`_ (2020, PDF, in French).
- Comparison on a small `sample of Polish news texts and forums <https://github.com/tsolewski/Text_extraction_comparison_PL>`_.
- Most efficient open-source library in *ScrapingHub*'s `article extraction benchmark <https://github.com/scrapinghub/article-extraction-benchmark>`_
- Best overall tool according to `Bien choisir son outil d'extraction de contenu à partir du Web <https://hal.archives-ouvertes.fr/hal-02768510v3/document>`_ (Lejeune & Barbaresi 2020)
- Comparison on a small `sample of Polish news texts and forums <https://github.com/tsolewski/Text_extraction_comparison_PL>`_ (now integrated in the internal benchmark, Trafilatura has improved since)
- Best single tool by ROUGE-LSum Mean F1 Page Scores in `An Empirical Comparison of Web Content Extraction Algorithms <https://webis.de/downloads/publications/papers/bevendorff_2023b.pdf>`_ (Bevendorff et al. 2023)


Older results
Expand Down
28 changes: 15 additions & 13 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,10 @@ Features
Evaluation and alternatives
~~~~~~~~~~~~~~~~~~~~~~~~~~~

For detailed results see the `benchmark <evaluation.html>`_ and `evaluation script <https://github.com/adbar/trafilatura/blob/master/tests/comparison.py>`_. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the *tests* directory.
Trafilatura consistently outperforms other open-source libraries in text extraction benchmarks, showcasing its efficiency and accuracy in extracting web content. The extractor tries to strike a balance between limiting noise and including all valid parts.

For detailed results see the `benchmark <evaluation.html>`_. The results can be reproduced, see the `evaluation readme <https://github.com/adbar/trafilatura/blob/master/tests/README.rst>_` for instructions.


Other evaluations:
^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -117,17 +120,18 @@ For more information please refer to `usage documentation <usage.html>`_ and `tu
License
-------

*Trafilatura* is distributed under the `GNU General Public License v3.0 <https://github.com/adbar/trafilatura/blob/master/LICENSE>`_. If you wish to redistribute this library but feel bounded by the license conditions please try interacting `at arms length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, `multi-licensing <https://en.wikipedia.org/wiki/Multi-licensing>`_ with `compatible licenses <https://en.wikipedia.org/wiki/GNU_General_Public_License#Compatibility_and_multi-licensing>`_, or `contacting me <https://github.com/adbar/trafilatura#author>`_.
*Trafilatura* is distributed under the `GNU General Public License v3.0 <https://github.com/adbar/trafilatura/blob/master/LICENSE>`_. This license promotes collaboration in software development and ensures that Trafilatura's code remains publicly accessible.

If you wish to redistribute this library but are concerned about the license conditions, consider interacting `at arms length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, combining with `compatible licenses <https://www.gnu.org/licenses/license-list.html#GPLCompatibleLicenses>`_, or `contacting the author <https://adrien.barbaresi.eu>`_ for more options.

See also `GPL and free software licensing: What's in it for business? <https://web.archive.org/web/20230127221311/https://www.techrepublic.com/article/gpl-and-free-software-licensing-whats-in-it-for-business/>`_
For insights into GPL and free software licensing with emphasis on a business context, see `GPL and Free Software Licensing: What's in it for Business? <https://web.archive.org/web/20230127221311/https://www.techrepublic.com/article/gpl-and-free-software-licensing-whats-in-it-for-business/>`_



Context
-------


These documentation pages also provide information on `concepts behind data collection <background.html>`_ as well as practical tips on how to gather web texts (see `tutorials <tutorials.html>`_).
Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge. These documentation pages also provide information on `concepts behind data collection <background.html>`_ as well as practical tips on how to gather web texts (see `tutorials <tutorials.html>`_).



Expand All @@ -136,8 +140,6 @@ Contributing

Contributions are welcome! See `CONTRIBUTING.md <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated page <https://github.com/adbar/trafilatura/issues>`_.

Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who submitted features and bugfixes!


Roadmap
~~~~~~~
Expand All @@ -148,7 +150,9 @@ For planned enhancements and relevant milestones see `issues page <https://githu
Author
~~~~~~

This effort is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.
Reach out via the `contact page <https://adrien.barbaresi.eu/>`_ for inquiries, collaborations, or feedback. See also `Twitter/X <https://x.com/adbarbaresi>`_ for the latest updates.

This work started as a PhD project at the crossroads of linguistics and NLP, this expertise has been instrumental in shaping Trafilatura over the years. It has first been released under its current form in 2019, its development is referenced in the following publications:


- Barbaresi, A. `Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction <https://aclanthology.org/2021.acl-demo.15/>`_, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
Expand Down Expand Up @@ -184,17 +188,15 @@ You can contact me via my `contact page <https://adrien.barbaresi.eu/>`_ or on `
Software ecosystem
~~~~~~~~~~~~~~~~~~

This software is part of a larger ecosystem. It is employed in a variety of academic and development projects, demonstrating its versatility and effectiveness. Case studies and publications are listed on the `Used By documentation page <used-by.html>`_.

Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis:

.. image:: software-ecosystem.png
:alt: Software ecosystem
:align: center
:width: 65%


*Trafilatura*: `Italian word <https://en.wiktionary.org/wiki/trafilatura>`_ for `wire drawing <https://en.wikipedia.org/wiki/Wire_drawing>`_.

`Known uses of the software <used-by.html>`_.

Corresponding posts on `Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_ (blog).


Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# with version specifier
sphinx>=7.2.6
pydata-sphinx-theme>=0.15.1
pydata-sphinx-theme>=0.15.2
docutils>=0.20.1
# without version specifier
trafilatura
Expand Down
5 changes: 4 additions & 1 deletion docs/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,13 @@ Downloads
HTTP library
^^^^^^^^^^^^

Using another download utility (see ``pycurl`` with Python and ``wget`` or ``curl`` on the command-line).
In the default settings Trafilatura identifies itself in the `User-Agent header <https://en.wikipedia.org/wiki/User-Agent_header>`_. It may have been compromised by others on certain websites and thus blocked, see `this discussion <https://www.webmasterworld.com/search_engine_spiders/5090863.htm>`_.

For various reasons, it is also possible that the standard download utility doesn't come through. Using another one is then an option (see ``pycurl`` with Python and ``wget`` or ``curl`` on the command-line).

- Installing the additional download utility ``pycurl`` manually or using ``pip3 install trafilatura[all]`` can alleviate the problem: another download library is used, leading to different results.
- Several alternatives are available on the command-line, e.g. ``wget -O - "my_url" | trafilatura`` instead of ``trafilatura -u "my_url"``.
- Emulating a browser is also possible, see the information on headless browsing above.

.. note::
Downloads may fail because your IP or user agent are blocked. Trafilatura's crawling and download capacities do not bypass such restrictions.
Expand Down
32 changes: 32 additions & 0 deletions docs/usage-api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
API
===

.. meta::
:description lang=en:
See how to use the official Trafilatura API to download or extract data for free or for larger volumes.


Introduction
------------

Use the last version of the software straight from the application programming interface. This is especially useful if you want to try out Trafilatura without installing it or if you want to support the project while saving time.

- Fast URL download, or use HTML file as input
- Configurable output


Endpoints
---------

The official API comes in two versions, available from two different gateways:

- `Free for demonstration purposes <https://trafilatura.mooo.com>`_ (including documentation page)
- `For a larger volume of requests <https://rapidapi.com/trafapi/api/trafilatura>`_ (documentation and plans)



Further information
-------------------

The API is still an early-stage product and the code is currently not available under an open-source license.

1 change: 1 addition & 0 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Usage
usage-python
usage-cli
usage-r
usage-api
usage-gui
downloads
crawls
Expand Down
Loading

0 comments on commit 379ddeb

Please sign in to comment.