Determine the "layer" of the text extracted #1814
-
Hello, can you please help me once again? When I parse PDFs sometimes I get the text which is not actually visible. Is there a way to determine if the extracted text is actually visible or lies somewhere "underneath"?
Thank you for a great project and help! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You did not mention why the pixmap idea did not work ... Detecting hiddenness in general is a difficult thing:
To determine the painting order of any object on the page, you can use Given some text bbox, you can therefore determine in which bbox of the bboxlog it is located. The bbox type also tells you the type of text, e.g. if one of the bboxlog items is |
Beta Was this translation helpful? Give feedback.
You did not mention why the pixmap idea did not work ...
Detecting hiddenness in general is a difficult thing:
Page.draw-*
) it is practically impossible to find that out by analyzing the output ofpage.get_drawings()
. At best you can determine, whether some drawing rectangle overlaps your text and - with some tricks - also whether the text has been painted on the page before or after the drawing.page.get_text("dict"...)
and request, that also images are extracted, then the sequence of the image blocks and the text blocks reflect the sequence in which these el…