Skip to content

Add a text or special character for images and tables when extracting text from PDF #1827

Answered by JorjMcKie
Da-vid21 asked this question in Q&A
Discussion options

You must be logged in to vote

No real problem.
I am not sure if your "images" are true images or PDF drawings - which is a completely different animal.

You can extract text and images together via page.get_text("dict", etc), where bot block tapes come with full position and metadat info.
The sequence in which these things are extracted reflects the sequence in wich they are being "painted" on the page.
But you can also sort those blocks by their visibility on the page of course - e.g. top-left to bottom-right.

Tables in PDF are just normal text. They are not per se identifyable as "table" - just normal text, and it is left to your wits to make sense out of the single text pieces in whatever columns, et.

Presumably, yo…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@Da-vid21
Comment options

Comment options

You must be logged in to vote
0 replies
Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
2 participants
Converted from issue

This discussion was converted from issue #1826 on July 20, 2022 17:34.