Pymupdf grouping same text of different pages in different text_blocks #2899
Unanswered
vignesh0710
asked this question in
Q&A
Replies: 1 comment
-
|
Did you not try to sort the text blocks before comparing them? If that is still too coarse-grained, you can try the same on a word or line level: Lines can be extracted and sorted line this: lines = []
for b in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
lines.extend(b["lines"])
lines.sort(key=lambda l: (l["bbox"][3], l["bbox"][0])) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Description of the bug
Goal: Pymupdf highlight difference between 2 pdf pages, version (1.22.1)
Trying to compare 2
pdf pages - p1 and p2and highlight the difference inp1Algorithm:
Code:
difference psuedo_code:
This works. But in certain cases though the
contentsareidenticalthey get grouped intodifferent text blocksso while comparing it is highlighting wrong.Example:
p1:
p2:
Though the identical 3 lines (back-to-back) -
line1, line2, line3are present in both the pagesp1andp2since theblocksare different it is getting flaggedAlso, tried the
get_textand compareline by lineapproach, it is not working.Any suggestions on how to fix this will be helpful?
How to reproduce the bug
explained above
PyMuPDF version
1.23.5 or earlier
Operating system
Windows
Python version
3.8
Beta Was this translation helpful? Give feedback.
All reactions