Extract Text from a PDF that has alternating colors for each row. #1842

joeanton719 · 2022-07-29T07:42:17Z

joeanton719
Jul 29, 2022

I posted a question a month ago on stack overflow regarding extracting text from a PDF with alternating colors for each row. The link to the StackOverflow post is here: Stack Overflow qtn

I have also attached the PDF here
page.pdf

I can easily extract the texts without issues using other python packages such as tabula-py. But, my main struggle is to group the data correctly. As there are no visible border lines for each row, there is no way to know how to divide each row.

I mentioned the details in the stack overflow post (the link above). I would appreciate any help regarding this. Thank you.

Answered by JorjMcKie

Jul 29, 2022

With PyMuPDF, this is possible to achieve. The approach is to

extract the coordinates fo the visible drawings: lines and rectangles used as row shaders,
extract the text and map each text piece's boundary box to the right row.

I noticed on stack overflow that you provided column delimiter coordinates. I wonder where you got them from.
In the file, table columns are not visibly distinguishable - and would have to be derived somehow.

Anyway, here is a script that processes your page. Pls feel free to ask for explanations.
reformat.zip

View full answer

JorjMcKie · 2022-07-29T13:10:45Z

JorjMcKie
Jul 29, 2022
Maintainer

With PyMuPDF, this is possible to achieve. The approach is to

extract the coordinates fo the visible drawings: lines and rectangles used as row shaders,
extract the text and map each text piece's boundary box to the right row.

I noticed on stack overflow that you provided column delimiter coordinates. I wonder where you got them from.
In the file, table columns are not visibly distinguishable - and would have to be derived somehow.

Anyway, here is a script that processes your page. Pls feel free to ask for explanations.
reformat.zip

1 reply

joeanton719 Jul 29, 2022
Author

@JorjMcKie Thank you so much for the response! Right now, I am just going through your code and trying to figure out what each line of code does 😅. Perhaps, I might be able to learn better during the weekend.

To answer your question about the column delimiter coordinates:

What I usually do is open the PDF using adobe, then I go View > show/hide > Cursor Coordinates.

Then a black box will appear at the top left corner of the screen showing the current mouse pointer coordinates. Then I place the mouse cursor between each column of the PDF page and take the X coordinates for each of the columns. Luckily, the column dimensions are fixed for the rest of the pages. So the coordinates will remain the same for the rest of the PDF pages.

Note: The units of the coordinates are in "Points". Usually, by default, the cursor coordinates are set to inches. In order to set the unit to Points, go to Edit > Preferences. Then on the left side, towards the end, click on "Units & Guides". Then at the top, under the Units section, choose Points from the Page & Ruler Units drop-down list.

JorjMcKie · 2022-07-29T16:07:47Z

JorjMcKie
Jul 29, 2022
Maintainer

figure out what each line of code does

😎 don't ever hesitate to ask. Actually I have tried to document well ... 😒

As per the column coordinates:
One could do this programmatically. The approach would be to build clusters of left-side text span coordinates. Whenever a big-enough number of x coordinates is found, record it as a column delimiter ...
Something like that.

2 replies

joeanton719 Jul 29, 2022
Author

Could you kindly show me how to get the delimiters for columns with the method you just mentioned above?

JorjMcKie Jul 30, 2022
Maintainer

I will send you an updated version of that script.
But the basic idea is to use the advanced internal heuristics of MuPDF (PyMuPDF's C base library), which identifies so-called "text spans" - pieces of text with identical font properties on the same base line.
In tables on a page, you will in most cases find, that each span lives within the same table cell.
So you can walk through the text spans recording their left (and / or right) coordinates. Then apply a good-enough algorithm which figures out the most frequent of these values ...

JorjMcKie · 2022-07-30T09:27:55Z

JorjMcKie
Jul 30, 2022
Maintainer

Here is the improved script. Hope it helps.
reformat.zip

1 reply

ColeDrain Aug 19, 2025

thank you for this script.
I know this has been a while, but can this be a feature?

jamie-lemon · 2025-09-03T14:23:51Z

jamie-lemon
Sep 3, 2025
Maintainer

@ColeDrain If required please continue this discussion on our forum at https://forum.pymupdf.com , we are trying to move all discussions there! :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract Text from a PDF that has alternating colors for each row. #1842

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extract Text from a PDF that has alternating colors for each row. #1842

Uh oh!

joeanton719 Jul 29, 2022

Replies: 4 comments · 4 replies

Uh oh!

JorjMcKie Jul 29, 2022 Maintainer

Uh oh!

Uh oh!

joeanton719 Jul 29, 2022 Author

Uh oh!

JorjMcKie Jul 29, 2022 Maintainer

Uh oh!

joeanton719 Jul 29, 2022 Author

Uh oh!

JorjMcKie Jul 30, 2022 Maintainer

Uh oh!

JorjMcKie Jul 30, 2022 Maintainer

Uh oh!

ColeDrain Aug 19, 2025

Uh oh!

jamie-lemon Sep 3, 2025 Maintainer

joeanton719
Jul 29, 2022

Replies: 4 comments 4 replies

JorjMcKie
Jul 29, 2022
Maintainer

joeanton719 Jul 29, 2022
Author

JorjMcKie
Jul 29, 2022
Maintainer

joeanton719 Jul 29, 2022
Author

JorjMcKie Jul 30, 2022
Maintainer

JorjMcKie
Jul 30, 2022
Maintainer

jamie-lemon
Sep 3, 2025
Maintainer