Skip to content

Extract Text from a PDF that has alternating colors for each row. #1842

Discussion options

You must be logged in to vote

With PyMuPDF, this is possible to achieve. The approach is to

  • extract the coordinates fo the visible drawings: lines and rectangles used as row shaders,
  • extract the text and map each text piece's boundary box to the right row.

I noticed on stack overflow that you provided column delimiter coordinates. I wonder where you got them from.
In the file, table columns are not visibly distinguishable - and would have to be derived somehow.

Anyway, here is a script that processes your page. Pls feel free to ask for explanations.
reformat.zip

Replies: 3 comments 3 replies

Comment options

You must be logged in to vote
1 reply
@joeanton719
Comment options

Answer selected by joeanton719
Comment options

You must be logged in to vote
2 replies
@joeanton719
Comment options

@JorjMcKie
Comment options

Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants