Extract Text from a PDF that has alternating colors for each row. #1842
-
I posted a question a month ago on stack overflow regarding extracting text from a PDF with alternating colors for each row. The link to the StackOverflow post is here: Stack Overflow qtn I have also attached the PDF here I can easily extract the texts without issues using other python packages such as tabula-py. But, my main struggle is to group the data correctly. As there are no visible border lines for each row, there is no way to know how to divide each row. I mentioned the details in the stack overflow post (the link above). I would appreciate any help regarding this. Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
With PyMuPDF, this is possible to achieve. The approach is to
I noticed on stack overflow that you provided column delimiter coordinates. I wonder where you got them from. Anyway, here is a script that processes your page. Pls feel free to ask for explanations. |
Beta Was this translation helpful? Give feedback.
-
😎 don't ever hesitate to ask. Actually I have tried to document well ... 😒 As per the column coordinates: |
Beta Was this translation helpful? Give feedback.
-
Here is the improved script. Hope it helps. |
Beta Was this translation helpful? Give feedback.
With PyMuPDF, this is possible to achieve. The approach is to
I noticed on stack overflow that you provided column delimiter coordinates. I wonder where you got them from.
In the file, table columns are not visibly distinguishable - and would have to be derived somehow.
Anyway, here is a script that processes your page. Pls feel free to ask for explanations.
reformat.zip