Determine the "layer" of the text extracted #1814

maiiabocharova · 2022-07-14T10:13:28Z

maiiabocharova
Jul 14, 2022

Hello, can you please help me once again?

When I parse PDFs sometimes I get the text which is not actually visible. Is there a way to determine if the extracted text is actually visible or lies somewhere "underneath"?
This didn't give me expected results:

# make a pixmap of the page
pix = page.get_pixmap(dpi=150)
# make a matrix that transforms to pixmap coordinates
mat = page.rect.torect(pix.irect)
# search for text locations
rlist = page.search_for("ﬁle:///home/cvdesignr/cv/pd")
# check color environment of each occurrence
# we will check for "almost unicolor"
for r in rlist:
    if pix.color_topusage(clip=r * mat)[0] > 0.95:
        print("'pixmap' invisible here:", r)

Thank you for a great project and help!

Answered by JorjMcKie

Jul 14, 2022

You did not mention why the pixmap idea did not work ...

Detecting hiddenness in general is a difficult thing:

If there are drawings (not images, but stuff you would create via Page.draw-*) it is practically impossible to find that out by analyzing the output of page.get_drawings(). At best you can determine, whether some drawing rectangle overlaps your text and - with some tricks - also whether the text has been painted on the page before or after the drawing.
If you have images, the situation is somewhat better: If you do page.get_text("dict"...) and request, that also images are extracted, then the sequence of the image blocks and the text blocks reflect the sequence in which these el…

View full answer

JorjMcKie · 2022-07-14T13:30:41Z

JorjMcKie
Jul 14, 2022
Maintainer

You did not mention why the pixmap idea did not work ...

Detecting hiddenness in general is a difficult thing:

If there are drawings (not images, but stuff you would create via Page.draw-*) it is practically impossible to find that out by analyzing the output of page.get_drawings(). At best you can determine, whether some drawing rectangle overlaps your text and - with some tricks - also whether the text has been painted on the page before or after the drawing.
If you have images, the situation is somewhat better: If you do page.get_text("dict"...) and request, that also images are extracted, then the sequence of the image blocks and the text blocks reflect the sequence in which these elements were painted. But you still would have to find out whether an overlapping image in fact hides the text: e.g. it could be transparent!
PDF text may have been written with the text rendering option 3 (command "3 Tr"), which always makes it invisible. This is used with OCRed text in scanned PDFs (mostly). Unfortunately, page.get_text() has no access to this text property - as opposed to page.get_texttrace().

To determine the painting order of any object on the page, you can use page.get_bboxlog(). This list names all bboxes of stuff on the page: images, text, drawings, ...
The position in that list reflects the sequence of the respective painting operation.

Given some text bbox, you can therefore determine in which bbox of the bboxlog it is located. The bbox type also tells you the type of text, e.g. if one of the bboxlog items is ("ignore-text", (x0, y0, x1, y1)) and your text bbox is contained in fitz.Rect(x0, y0, x1, y1), then you know it is text hidden by 3 Tr. And so on.

1 reply

maiiabocharova Jul 15, 2022
Author

Thank you a lot! I'll try all of the suggestions, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Determine the "layer" of the text extracted #1814

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Determine the "layer" of the text extracted #1814

Uh oh!

maiiabocharova Jul 14, 2022

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Jul 14, 2022 Maintainer

Uh oh!

maiiabocharova Jul 15, 2022 Author

maiiabocharova
Jul 14, 2022

Replies: 1 comment 1 reply

JorjMcKie
Jul 14, 2022
Maintainer

maiiabocharova Jul 15, 2022
Author