-
Hi. I am very new to programming and PyMuPDF. I was wondering if it is possible to extract underline texts next to section numbers? For example: 1.1.1 Testing – This sentence is for testing. The bolded word represents underline text. I found some posts stating it might not be possible directly but is it possible to maybe search for underline words, highlight it, and then extract those words? |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 20 replies
-
First of all: |
Beta Was this translation helpful? Give feedback.
-
It depends by what PDF mechanism the underlining took place. There are two alternatives:
Alternative 2 is rare and, although also possible with PyMuPDF, clearly beyond the reach of a beginner. In contrast to this, detecting and extracting text marked with annotations (alternative 1) is easy. Text marking annotations are of type underline, highlight, strikeout and squiggle (a zigzag underline): for annot in page.annots(types=(fitz.PDF_ANNOT_HIGHLIGHT,
fitz.PDF_ANNOT_UNDERLINE,
fitz.PDF_ANNOT_SQUIGGLY,
fitz.PDF_ANNOT_STRIKE_OUT)):
clip = annot.rect + (-2, -2, 2, 2) # annot rectangle ... enlarged by 2 points in every direction
text = page.get_textbox(clip)
print(f"this text is marked: '{text}'.") |
Beta Was this translation helpful? Give feedback.
-
Here is some idea you could at least start with. Example PDF: export from a word document made from some Wikipedia article in German: In [1]: import fitz
In [2]: doc = fitz.open("test.pdf")
In [3]: page = doc[0]
In [4]: paths = page.get_drawings() # get drawings on the page
In [5]: len(paths) # see how many
Out[5]: 9
In [6]: # subselect things we may regard as lines
In [7]: lines = []
...: for p in paths:
...: for item in p["items"]:
...: if item[0] == "l": # an actual line
...: p1, p2 = item[1]
...: if p1.y == p2.y:
...: lines.append((p1, p2))
...: elif item[0] == "re": # a rectangle: check if height is small
...: r = item[1]
...: if r.width > r.height and r.height <= 2:
...: lines.append((r.tl, r.tr)) # take top left / right points
In [8]: len(lines) # confirm we got everything
Out[8]: 9
In [9]: # example:
In [10]: lines[0]
Out[10]:
(Point(336.9100036621094, 98.300048828125),
Point(373.6300048828125, 98.300048828125))
In [11]: # make a list of words
In [12]: words = page.get_text("words", sort=True)
In [13]: # if underlined, the bottom left / right of a word
In [14]: # should not be too far away from left / right end of some line:
In [15]: for w in words: # w[4] is the actual word string
...: r = fitz.Rect(w[:4]) # first 4 items are the word bbox
...: for p1, p2 in lines: # check distances for start / end points
...: if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
...: print(f"Word '{w[4]}' is underlined!")
...: break # don't check more lines
Word 'Familie' is underlined!
Word 'Delfine' is underlined!
Word 'DNA-Analysen' is underlined!
Word 'Unterarten' is underlined!
Word 'Bartenwale' is underlined!
Word 'Spitzenprädatoren' is underlined!
Word 'Fressfeinde' is underlined!
Word 'Walfang' is underlined!
Word 'Delfinarien.' is underlined!
In [16]: # heureka! Be a little generous with distance checking here: E.g. the word extracted is 'Delfinarien.' (including the dot), but the underlining does not include the dot ... and more of that sort. |
Beta Was this translation helpful? Give feedback.
-
Another idea: |
Beta Was this translation helpful? Give feedback.
-
If however we have regular text and lines: drawn_lines=[...] # your identified lines
blocks=page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
max_lineheight=0
for b in blocks:
for l in b["lines"]:
bbox=fitz.Rect(l["bbox"])
if bbox.height > max_lineheight:
max_lineheight = bbox.height
# we now have the max lineheight on this page
for p1, p2 in draw_lines:
rect = fitz.Rect(p1.x, p1.y - max_lineheight, p2.x, p2.y) # the rectangle "above" a drawn line
text = page.get_textbox(rect)
print(f"Underlined: '{text}'.") |
Beta Was this translation helpful? Give feedback.
If however we have regular text and lines: