-
One problem I have been having with PyMuPDF with my chemistry papers is when I try to extract data from PDFs is when I reach chemical compound structure images and tables containing data and with PyMuPDF it skips the images, so I don't know where the images would be after the image extraction, and it also tries to parse the data from tables, but the problem is with the inconsistency of the PDFs there is no one solution to tables. If there is anyway to add something like "Image 1" and "Table 1" and so on, that would be great. If there is anyway I could also incorporate other libraries to fix the issue, that would be great. Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
This is no issue, but a typical Discussions post. |
Beta Was this translation helpful? Give feedback.
-
No real problem. You can extract text and images together via Tables in PDF are just normal text. They are not per se identifyable as "table" - just normal text, and it is left to your wits to make sense out of the single text pieces in whatever columns, et. Presumably, your "images" are PDF drawings really. These are elementary commands for drawing lines, curves or rectangles. This would explain, why you feel that "images" are "skipped". Typically used for Gantt charts, block diagrams and such. |
Beta Was this translation helpful? Give feedback.
No real problem.
I am not sure if your "images" are true images or PDF drawings - which is a completely different animal.
You can extract text and images together via
page.get_text("dict", etc)
, where bot block tapes come with full position and metadat info.The sequence in which these things are extracted reflects the sequence in wich they are being "painted" on the page.
But you can also sort those blocks by their visibility on the page of course - e.g. top-left to bottom-right.
Tables in PDF are just normal text. They are not per se identifyable as "table" - just normal text, and it is left to your wits to make sense out of the single text pieces in whatever columns, et.
Presumably, yo…