Is it possible to get all bounding boxes and their associated text #1319
Replies: 1 comment
-
Those APIs are more geared toward OCRmyPDF's internal use. The easiest way to get bounding boxes would be to first run OCRmyPDF on any PDFs you have that are missing text, and then using a library like pdfminer.six or pdfplumber (which provides a higher level interface for the former). These library detect text in the PDF, whether it was produced by OCR or whether the PDF initially had text. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey everyone,
I'd like to OCR a PDF document and then get structured info about all the bounding boxes in the document and the relevant text fragments in each one. Is that possible?
Looking at the source code, I see that there's a PdfContext that is instantiated before the pipeline runs. It appears that it should be possible to do something like context.get_page_contexts, and with each do page_context.pageinfo.get_textareas (or something like that). But, this context is not returned when running the pipeline. Is there another way to get it?
For my reference:
OCRmyPDF/src/ocrmypdf/_pipelines/ocr.py
Line 184 in cb2f090
OCRmyPDF/src/ocrmypdf/_jobcontext.py
Line 63 in cb2f090
OCRmyPDF/src/ocrmypdf/pdfinfo/info.py
Line 1004 in cb2f090
Beta Was this translation helpful? Give feedback.
All reactions