Skip to content

Determine the "layer" of the text extracted #1814

Discussion options

You must be logged in to vote

You did not mention why the pixmap idea did not work ...

Detecting hiddenness in general is a difficult thing:

  • If there are drawings (not images, but stuff you would create via Page.draw-*) it is practically impossible to find that out by analyzing the output of page.get_drawings(). At best you can determine, whether some drawing rectangle overlaps your text and - with some tricks - also whether the text has been painted on the page before or after the drawing.
  • If you have images, the situation is somewhat better: If you do page.get_text("dict"...) and request, that also images are extracted, then the sequence of the image blocks and the text blocks reflect the sequence in which these el…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@maiiabocharova
Comment options

Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants