You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/pymupdf-layout/index.rst
+21-4Lines changed: 21 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
2
2
.. include:: ../header.rst
3
3
4
-
.. _pymupdf-layout
4
+
.. _pymupdf-layout:
5
5
6
6
7
7
PyMuPDF Layout
@@ -22,6 +22,8 @@ Install from |PyPI| with::
22
22
pip install pymupdf-layout
23
23
24
24
25
+
.. _pymupdf_layout_using:
26
+
25
27
Using
26
28
----------------------------------
27
29
@@ -118,16 +120,31 @@ Now we can happily load Office files and convert them as follows::
118
120
md = pymupdf4llm.to_markdown("sample.docx")
119
121
120
122
123
+
.. _pymupdf_layout_ocr_support:
124
+
121
125
OCR support
122
126
~~~~~~~~~~~~~~~~~
123
127
124
-
The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
128
+
The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
125
129
126
-
If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographies).
130
+
If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographs).
127
131
128
132
If the page does contain text but too many characters are unreadable (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
129
133
130
-
For these heuristics to work we need both, an existing Tesseract installation and the availability of OpenCV in the Python environment. If either is missing, no OCR is attempted at all.
134
+
For these heuristics to work we need both, an existing :ref:`Tesseract installation <installation_ocr>` and the availability of `OpenCV <https://pypi.org/project/opencv-python/>`_ in the Python environment. If either is missing, no OCR is attempted at all.
135
+
136
+
The decision tree for whether OCR is actually used or not depends on the following:
137
+
138
+
1. :ref:`PyMuPDF Layout is imported <pymupdf_layout_using>`
139
+
140
+
2. In the :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` you have `use_ocr` enabled (this is set to `True` by default)
141
+
142
+
3. :ref:`Tesseract is correctly installed <installation_ocr>`
143
+
144
+
4. `OpenCV <https://pypi.org/project/opencv-python/>`_ is available in your Python environment
|PyMuPDF4LLM| is aimed to make it easier to extract |PDF| content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown extraction <extracting_as_md>` as well as :ref:`LlamaIndex document output <extracting_as_llamaindex>`.
11
11
12
-
When using |PyMuPDF4LLM| with PyMuPDF-Layout, page layout detection will be greatly improved. This is true for table detection, but also for the detection of page headers and footers, footnotes, list items and text paragraphs. In addition two new methods become available, `to_json()` and `to_text()`.
12
+
When using |PyMuPDF4LLM| with PyMuPDFLayout, page layout detection will be greatly improved. This is true for table detection, but also for the detection of page headers and footers, footnotes, list items and text paragraphs. In addition two new methods become available, `to_json()` and `to_text()`.
13
13
14
14
.. important::
15
15
@@ -22,8 +22,8 @@ Features
22
22
- Support for image and vector graphics extraction (and inclusion of references in the MD text)
23
23
- Support for page chunking output.
24
24
- Direct support for output as :ref:`LlamaIndex Documents <extracting_as_llamaindex>`.
25
-
- In "layout mode": Support for plain text output similar to Markdown
26
-
- In "layout mode": Support for JSON output
25
+
- When used with :ref:`PyMuPDF Layout <pymupdf-layout>` : Support for plain text output similar to Markdown
26
+
- When used with :ref:`PyMuPDF Layout <pymupdf-layout>` : Support for JSON output
0 commit comments