Add a text or special character for images and tables when extracting text from PDF #1827

Da-vid21 · 2022-07-20T16:09:11Z

Da-vid21
Jul 20, 2022

One problem I have been having with PyMuPDF with my chemistry papers is when I try to extract data from PDFs is when I reach chemical compound structure images and tables containing data and with PyMuPDF it skips the images, so I don't know where the images would be after the image extraction, and it also tries to parse the data from tables, but the problem is with the inconsistency of the PDFs there is no one solution to tables.

If there is anyway to add something like "Image 1" and "Table 1" and so on, that would be great. If there is anyway I could also incorporate other libraries to fix the issue, that would be great. Thanks

Answered by JorjMcKie

Jul 20, 2022

No real problem.
I am not sure if your "images" are true images or PDF drawings - which is a completely different animal.

You can extract text and images together via page.get_text("dict", etc), where bot block tapes come with full position and metadat info.
The sequence in which these things are extracted reflects the sequence in wich they are being "painted" on the page.
But you can also sort those blocks by their visibility on the page of course - e.g. top-left to bottom-right.

Tables in PDF are just normal text. They are not per se identifyable as "table" - just normal text, and it is left to your wits to make sense out of the single text pieces in whatever columns, et.

Presumably, yo…

View full answer

JorjMcKie · 2022-07-20T17:33:57Z

JorjMcKie
Jul 20, 2022
Maintainer

This is no issue, but a typical Discussions post.

1 reply

Da-vid21 Jul 20, 2022
Author

Yeah, sorry about that, just got confused.

JorjMcKie · 2022-07-20T18:12:58Z

JorjMcKie
Jul 20, 2022
Maintainer

No real problem.
I am not sure if your "images" are true images or PDF drawings - which is a completely different animal.

You can extract text and images together via page.get_text("dict", etc), where bot block tapes come with full position and metadat info.
The sequence in which these things are extracted reflects the sequence in wich they are being "painted" on the page.
But you can also sort those blocks by their visibility on the page of course - e.g. top-left to bottom-right.

Tables in PDF are just normal text. They are not per se identifyable as "table" - just normal text, and it is left to your wits to make sense out of the single text pieces in whatever columns, et.

Presumably, your "images" are PDF drawings really. These are elementary commands for drawing lines, curves or rectangles. This would explain, why you feel that "images" are "skipped". Typically used for Gantt charts, block diagrams and such.
Drawings must be extracted separately via page.get_drawings().

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a text or special character for images and tables when extracting text from PDF #1827

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Add a text or special character for images and tables when extracting text from PDF #1827

Da-vid21 Jul 20, 2022

Replies: 2 comments · 1 reply

JorjMcKie Jul 20, 2022 Maintainer

Da-vid21 Jul 20, 2022 Author

JorjMcKie Jul 20, 2022 Maintainer

Da-vid21
Jul 20, 2022

Replies: 2 comments 1 reply

JorjMcKie
Jul 20, 2022
Maintainer

Da-vid21 Jul 20, 2022
Author

JorjMcKie
Jul 20, 2022
Maintainer