Is it possible to get all bounding boxes and their associated text #1319

shahafabileah · 2024-05-24T23:26:10Z

shahafabileah
May 24, 2024

Hey everyone,
I'd like to OCR a PDF document and then get structured info about all the bounding boxes in the document and the relevant text fragments in each one. Is that possible?
Looking at the source code, I see that there's a PdfContext that is instantiated before the pipeline runs. It appears that it should be possible to do something like context.get_page_contexts, and with each do page_context.pageinfo.get_textareas (or something like that). But, this context is not returned when running the pipeline. Is there another way to get it?

def _run_pipeline(
    options: argparse.Namespace,
    plugin_manager: OcrmypdfPluginManager,
) -> ExitCode:
    with (...):
        ...
        context = PdfContext(options, work_folder, origin_pdf, pdfinfo, plugin_manager)

        # Validate options are okay for this pdf
        validate_pdfinfo_options(context)

        # Execute the pipeline
        optimize_messages = exec_concurrent(context, executor)

        exitcode = report_output_pdf(options, start_input_file, optimize_messages)
        return exitcode

For my reference:

OCRmyPDF/src/ocrmypdf/_pipelines/ocr.py

Line 184 in cb2f090

context = PdfContext(options, work_folder, origin_pdf, pdfinfo, plugin_manager)

OCRmyPDF/src/ocrmypdf/_jobcontext.py

Line 63 in cb2f090

class PageContext:

OCRmyPDF/src/ocrmypdf/pdfinfo/info.py

Line 1004 in cb2f090

    
           def get_textareas(self, visible: bool | None = None, corrupt: bool | None = None):

jbarlow83 · 2024-05-25T07:59:12Z

jbarlow83
May 25, 2024
Maintainer

Those APIs are more geared toward OCRmyPDF's internal use.

The easiest way to get bounding boxes would be to first run OCRmyPDF on any PDFs you have that are missing text, and then using a library like pdfminer.six or pdfplumber (which provides a higher level interface for the former). These library detect text in the PDF, whether it was produced by OCR or whether the PDF initially had text.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to get all bounding boxes and their associated text #1319

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is it possible to get all bounding boxes and their associated text #1319

shahafabileah May 24, 2024

Replies: 1 comment

jbarlow83 May 25, 2024 Maintainer

shahafabileah
May 24, 2024

jbarlow83
May 25, 2024
Maintainer