Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified docs/images/layout-ocr-flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 37 additions & 12 deletions docs/pymupdf-layout/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,28 +138,53 @@ Now we can happily load Office files and convert them as follows::
OCR support
~~~~~~~~~~~~~~~~~

The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.

If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographs).
**Critical: Import pymupdf.layout FIRST**
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the page does contain text but too many characters are unreadable (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
.. code-block:: python
:emphasize-lines: 1

For these heuristics to work we need both, an existing :ref:`Tesseract installation <installation_ocr>` and the availability of `OpenCV <https://pypi.org/project/opencv-python/>`_ in the Python environment. If either is missing, no OCR is attempted at all.
import pymupdf.layout # REQUIRED FIRST - enables OCR decision tree
import pymupdf4llm # Now OCR heuristics are active

The decision tree for whether OCR is actually used or not depends on the following:
md_text = pymupdf4llm.to_markdown("scanned.pdf")
# Auto: detects image pages → OCR → markdown

1. :ref:`PyMuPDF Layout is imported <pymupdf_layout_using>`
.. warning::
**Without `import pymupdf.layout`, OCR is NEVER attempted** -
even if Tesseract and OpenCV are installed.

2. In the :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` you have `use_ocr` enabled (this is set to `True` by default)
**Complete Requirements** (all must be satisfied)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

3. :ref:`Tesseract is correctly installed <installation_ocr>`
.. list-table:: OCR Decision Prerequisites
:widths: 15 85
:header-rows: 1

4. `OpenCV <https://pypi.org/project/opencv-python/>`_ is available in your Python environment
* - Check
- Requirement
* - 1. Layout
- :ref:`PyMuPDF Layout is imported <pymupdf_layout_using>`
* - 2. OCR API
- :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` you have ``use_ocr`` enabled (this is set to ``True`` by default)
* - 3. Tesseract
- :ref:`Tesseract OCR is correctly installed <installation_ocr>`
* - 4. OpenCV
- Available in the Python environment (``pip install opencv-python``)

**Smart OCR Heuristics** (Detailed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. image:: ../images/layout-ocr-flow.png
The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.

----
If a page contains (roughly) **no text at all**, but is covered with **images or many character-sized vectors**, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart **image-based text** from ordinary pictures (like photographs).

If the page **does contain text** but **too many characters are unreadable** (like "�����"), OCR is also executed, but **for the affected text areas only** – not the full page. This way, we avoid losing already existing text and other content like images and vectors.

**OCR Decision Tree**
^^^^^^^^^^^^^^^^^^^^

.. image:: ../images/layout-ocr-flow.png

.. _pymupdf_layout_and_pymupdf4llm_api:

Expand Down
Loading