Text behind drawing #1775
-
Hi there! I know that text visibility is an hard problem, so i'll go direct to the point. This is only a example, i'm searching for a general solution if possible. the blue lines in the table are text that i'm selecting to show that this text is behind the drawing. I believe that i can't reddact because it will erase the table text. Rewriting the entire pdf (including drawing, images, etc...) page by page to new pdf, but excluding text rect that is entirely behind a drawing (detect if it's behing by rendering order) is doable? If so, what is needed to reconstruct a page, copy drawings, images and text? Thanks in advance! The page: hidden (not 3 Tr) text selected |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
I am not sure I understand what you actually want:
If the first aspect: The second aspect is a little trickier, because text extraction will not tell you about visibility. But note that a full blown reliable analysis is still not guarantied: a covering image may be transparent, a drawing in the same place need not imply it makes things underneath invisible, etc. |
Beta Was this translation helpful? Give feedback.
-
Thanks! i want to remove the "invisible" text from the page, because when i use tools like camelot-py the table extraction fail or return the table with cell values contanimed with this "invisible" text (print 1 is the table, and the last print show that the cell values are polluted with strings that are behind the table draw. But your second response probably will help me to rebuild the page without this invisible text, thanks! |
Beta Was this translation helpful? Give feedback.
I am not sure I understand what you actually want:
If the first aspect:
This is a no-brainer: text extraction always works if it actually is text (which means: not everything looking like text is text)
The second aspect is a little trickier, because text extraction will not tell you about visibility.
But there are ways to still fnd things out:
page.get_bboxlog()
returns a list of rectangles of stuff being shown on a page, together with the type of content wrapped by the rect: text (including whether Tr 3), images or drawings.The sequence in the list repr…