Skip to content

Commit 6ef406a

Browse files
committed
Updates to visually tag "layout mode" explain PyMuPDF Layout more.
1 parent 90cdc13 commit 6ef406a

File tree

7 files changed

+82
-38
lines changed

7 files changed

+82
-38
lines changed

docs/images/layout-ocr-flow.png

58 KB
Loading

docs/installation.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -330,4 +330,9 @@ So for a working OCR functionality, make sure to complete this checklist:
330330
* Windows: `setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"`
331331
* Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata`
332332

333+
334+
.. note::
335+
336+
Find out more on the `official documention for installing Tesseract website <https://tesseract-ocr.github.io/tessdoc/Installation.html>`_.
337+
333338
.. include:: footer.rst

docs/pymupdf-layout/index.rst

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11

22
.. include:: ../header.rst
33

4-
.. _pymupdf-layout
4+
.. _pymupdf-layout:
55

66

77
PyMuPDF Layout
@@ -22,6 +22,8 @@ Install from |PyPI| with::
2222
pip install pymupdf-layout
2323

2424

25+
.. _pymupdf_layout_using:
26+
2527
Using
2628
----------------------------------
2729

@@ -118,16 +120,31 @@ Now we can happily load Office files and convert them as follows::
118120
md = pymupdf4llm.to_markdown("sample.docx")
119121

120122

123+
.. _pymupdf_layout_ocr_support:
124+
121125
OCR support
122126
~~~~~~~~~~~~~~~~~
123127

124-
The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
128+
The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
125129

126-
If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographies).
130+
If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographs).
127131

128132
If the page does contain text but too many characters are unreadable (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
129133

130-
For these heuristics to work we need both, an existing Tesseract installation and the availability of OpenCV in the Python environment. If either is missing, no OCR is attempted at all.
134+
For these heuristics to work we need both, an existing :ref:`Tesseract installation <installation_ocr>` and the availability of `OpenCV <https://pypi.org/project/opencv-python/>`_ in the Python environment. If either is missing, no OCR is attempted at all.
135+
136+
The decision tree for whether OCR is actually used or not depends on the following:
137+
138+
1. :ref:`PyMuPDF Layout is imported <pymupdf_layout_using>`
139+
140+
2. In the :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` you have `use_ocr` enabled (this is set to `True` by default)
141+
142+
3. :ref:`Tesseract is correctly installed <installation_ocr>`
143+
144+
4. `OpenCV <https://pypi.org/project/opencv-python/>`_ is available in your Python environment
145+
146+
147+
.. image:: ../images/layout-ocr-flow.png
131148

132149
----
133150

docs/pymupdf-pro/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
.. include:: ../header.rst
33

44

5-
.. _pymupdf-pro
5+
.. _pymupdf-pro:
66

77
PyMuPDF Pro
88
=============

docs/pymupdf4llm/api.rst

Lines changed: 46 additions & 24 deletions
Large diffs are not rendered by default.

docs/pymupdf4llm/index.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11

22
.. include:: ../header.rst
33

4-
.. _pymupdf4llm
4+
.. _pymupdf4llm:
55

66

77
PyMuPDF4LLM
88
===========================================================================
99

1010
|PyMuPDF4LLM| is aimed to make it easier to extract |PDF| content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown extraction <extracting_as_md>` as well as :ref:`LlamaIndex document output <extracting_as_llamaindex>`.
1111

12-
When using |PyMuPDF4LLM| with PyMuPDF-Layout, page layout detection will be greatly improved. This is true for table detection, but also for the detection of page headers and footers, footnotes, list items and text paragraphs. In addition two new methods become available, `to_json()` and `to_text()`.
12+
When using |PyMuPDF4LLM| with PyMuPDF Layout, page layout detection will be greatly improved. This is true for table detection, but also for the detection of page headers and footers, footnotes, list items and text paragraphs. In addition two new methods become available, `to_json()` and `to_text()`.
1313

1414
.. important::
1515

@@ -22,8 +22,8 @@ Features
2222
- Support for image and vector graphics extraction (and inclusion of references in the MD text)
2323
- Support for page chunking output.
2424
- Direct support for output as :ref:`LlamaIndex Documents <extracting_as_llamaindex>`.
25-
- In "layout mode": Support for plain text output similar to Markdown
26-
- In "layout mode": Support for JSON output
25+
- When used with :ref:`PyMuPDF Layout <pymupdf-layout>` : Support for plain text output similar to Markdown
26+
- When used with :ref:`PyMuPDF Layout <pymupdf-layout>` : Support for JSON output
2727

2828

2929
Functionality

docs/recipes.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@
1818

1919
----
2020

21+
.. toctree::
22+
23+
recipes-ocr.rst
24+
25+
----
2126

2227
.. toctree::
2328

@@ -61,11 +66,6 @@
6166

6267
----
6368

64-
.. toctree::
65-
66-
recipes-ocr.rst
67-
68-
----
6969

7070
.. toctree::
7171

0 commit comments

Comments
 (0)