Extracting underline texts from PDF #1756

jhines84 · 2022-06-16T17:36:51Z

jhines84
Jun 16, 2022

Hi. I am very new to programming and PyMuPDF. I was wondering if it is possible to extract underline texts next to section numbers? For example:

1.1.1 Testing – This sentence is for testing.

The bolded word represents underline text.

I found some posts stating it might not be possible directly but is it possible to maybe search for underline words, highlight it, and then extract those words?

Answered by JorjMcKie

Jun 20, 2022

If however we have regular text and lines:

drawn_lines=[...]  # your identified lines
blocks=page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
max_lineheight=0
for b in blocks:
    for l in b["lines"]:
        bbox=fitz.Rect(l["bbox"])
        if bbox.height > max_lineheight:
            max_lineheight = bbox.height
# we now have the max lineheight on this page
for p1, p2 in draw_lines:
    rect = fitz.Rect(p1.x, p1.y - max_lineheight, p2.x, p2.y) # the rectangle "above" a drawn line
    text = page.get_textbox(rect)
    print(f"Underlined: '{text}'.")

View full answer

JorjMcKie · 2022-06-16T17:57:52Z

JorjMcKie
Jun 16, 2022
Maintainer

First of all:
This is not an issue, but a question that should be asked in tab "Discussions".
So let me move this post to there first.

1 reply

jhines84 Jun 16, 2022
Author

Sorry about that. Brand new here lol

JorjMcKie · 2022-06-16T18:11:53Z

JorjMcKie
Jun 16, 2022
Maintainer

It depends by what PDF mechanism the underlining took place. There are two alternatives:

The PDF author created an underline annotation-.
The PDF author used PDF drawing facilities and directly drawed a line underneath that text.

Alternative 2 is rare and, although also possible with PyMuPDF, clearly beyond the reach of a beginner.

In contrast to this, detecting and extracting text marked with annotations (alternative 1) is easy. Text marking annotations are of type underline, highlight, strikeout and squiggle (a zigzag underline):

for annot in page.annots(types=(fitz.PDF_ANNOT_HIGHLIGHT,
          fitz.PDF_ANNOT_UNDERLINE,
          fitz.PDF_ANNOT_SQUIGGLY,
          fitz.PDF_ANNOT_STRIKE_OUT)):
    clip = annot.rect + (-2, -2, 2, 2)  # annot rectangle ... enlarged by 2 points in every direction
    text = page.get_textbox(clip)
    print(f"this text is marked: '{text}'.")

5 replies

jhines84 Jun 16, 2022
Author

hmm. So I made a test word file and converted it to a pdf. I've tried and run the code in python but there is no output other than the standard extracted text.

doc = fitz.open('Testing.pdf')
for page in doc:
    text = page.get_text("text")
    for annot in page.annots(types=(fitz.PDF_ANNOT_HIGHLIGHT,
          fitz.PDF_ANNOT_UNDERLINE,
          fitz.PDF_ANNOT_SQUIGGLY,
          fitz.PDF_ANNOT_STRIKE_OUT)):
        clip = annot.rect + (-2, -2, 2, 2)  # annot rectangle ... enlarged by 2 points in every direction
        text1 = page.get_textbox(clip)
        print(f"this text is marked: '{text1}'.")
    print(text)

JorjMcKie Jun 16, 2022
Maintainer

To be expected. Underlining with word yields the above type 2 underlining.
Forget that case.

jhines84 Jun 16, 2022
Author

Is there a solution to type 2? I am extracting texts that has different sections title that are underlined as shown above. I would like to pull those section titles out and put it in a different column. What would be an example of case 1? Like tables? Do you have an alternative to extracting those section titles?

JorjMcKie Jun 16, 2022
Maintainer

I gave you an extraction example for case 1.
To make such a PDF example use page.add_underline_annot(<some rectangle>). Details in the documentation.

If you want to go the tedious path, look up documentation for page.get_drawings().
Word underlining works by drawing thin rectangles. So you would have to inspect the page's drawings, look for those thin rectangles and then walk through the page's text and see if word bboxes are right above such rectangles.
Possible, but tedious and error-prone as I said.

jhines84 Jun 16, 2022
Author

Thanks for the help! Will look into that method. I really don't have another solution to extract those titles from the PDF.

JorjMcKie · 2022-06-17T07:38:54Z

JorjMcKie
Jun 17, 2022
Maintainer

Here is some idea you could at least start with. Example PDF: export from a word document made from some Wikipedia article in German:
test.pdf
There are 9 underlined words. They are also in a different color and detectable as links ... but lets ignore this here.

In [1]: import fitz
In [2]: doc = fitz.open("test.pdf")
In [3]: page = doc[0]
In [4]: paths = page.get_drawings()  # get drawings on the page
In [5]: len(paths)  # see how many
Out[5]: 9
In [6]: # subselect things we may regard as lines
In [7]: lines = []
   ...: for p in paths:
   ...:     for item in p["items"]:
   ...:         if item[0] == "l":  # an actual line
   ...:             p1, p2 = item[1]
   ...:             if p1.y == p2.y:
   ...:                 lines.append((p1, p2))
   ...:         elif item[0] == "re":  # a rectangle: check if height is small
   ...:             r = item[1]
   ...:             if r.width > r.height and r.height <= 2:
   ...:                 lines.append((r.tl, r.tr))  # take top left / right points
In [8]: len(lines)  # confirm we got everything
Out[8]: 9
In [9]: # example:
In [10]: lines[0]
Out[10]:
(Point(336.9100036621094, 98.300048828125),
 Point(373.6300048828125, 98.300048828125))
In [11]: # make a list of words
In [12]: words = page.get_text("words", sort=True)
In [13]: # if underlined, the bottom left / right of a word
In [14]: # should not be too far away from left / right end of some line:
In [15]: for w in words:  # w[4] is the actual word string
    ...:     r = fitz.Rect(w[:4])  # first 4 items are the word bbox
    ...:     for p1, p2 in lines:  # check distances for start / end points
    ...:         if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
    ...:             print(f"Word '{w[4]}' is underlined!")
    ...:             break  # don't check more lines
Word 'Familie' is underlined!
Word 'Delfine' is underlined!
Word 'DNA-Analysen' is underlined!
Word 'Unterarten' is underlined!
Word 'Bartenwale' is underlined!
Word 'Spitzenprädatoren' is underlined!
Word 'Fressfeinde' is underlined!
Word 'Walfang' is underlined!
Word 'Delfinarien.' is underlined!
In [16]: # heureka!

Be a little generous with distance checking here: E.g. the word extracted is 'Delfinarien.' (including the dot), but the underlining does not include the dot ... and more of that sort.
Underlines of multiple words (spanning the spaces in between) are the next hurdle, then you may also have to deal with text hyphenated across several lines, etc.

6 replies

JorjMcKie Jun 20, 2022
Maintainer

Is it because there isn't a space between the two words?

Sure. "words" is defined as strings not interrupted by any space.
If you have spaces in between words that are jointly underlined, look at a sequence of neighboring words (have the same bottom, w[3]) and check of a line starts at first word and ends at some other further to the right.

I mentioned it is not trivial, didn't I?

jhines84 Jun 20, 2022
Author

You did! lol This is definitely beyond the scope for a beginner programmer. This is the file I was able to get to work but is missing spaces and hypens.
Testing.pdf

jhines84 Jun 20, 2022
Author

can you point me to the documentation on how I can do that? or something similar?

JorjMcKie Jun 20, 2022
Maintainer

Do what?
Did you read my post further down?

jhines84 Jun 21, 2022
Author

If you have spaces in between words that are jointly underlined, look at a sequence of neighboring words (have the same bottom, w[3]) and check of a line starts at first word and ends at some other further to the right.

JorjMcKie · 2022-06-20T21:04:32Z

JorjMcKie
Jun 20, 2022
Maintainer

Another idea:
If you have the typical or maximum line height on the page, blow each of your line drawing up to a rectangle above it and make temporary underline annots for each of them on the page.
then extract the text from within these annots ...

5 replies

jhines84 Jun 20, 2022
Author

I'm not sure if that works because I have various pdf files ranges from word conversion to tilted scanned images from typewriter days.

JorjMcKie Jun 20, 2022
Maintainer

What has that to do with underlining?
Apart from the fact, that underlines are not even OCRed and simply not there!

jhines84 Jun 20, 2022
Author

I was thinking that the line height will be different due to the scanned images. Would you have to build a different line height case for each type of pdf since they all have different line height?

JorjMcKie Jun 20, 2022
Maintainer

No, you misunderstood completely:
If you OCR an image, the result will not contain any lines, only text.
And what is not there, can not be found 😒.

jhines84 Jun 20, 2022
Author

OMG...LOL. My nightmare never ends lol

JorjMcKie · 2022-06-20T22:32:36Z

JorjMcKie
Jun 20, 2022
Maintainer

If however we have regular text and lines:

drawn_lines=[...]  # your identified lines
blocks=page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
max_lineheight=0
for b in blocks:
    for l in b["lines"]:
        bbox=fitz.Rect(l["bbox"])
        if bbox.height > max_lineheight:
            max_lineheight = bbox.height
# we now have the max lineheight on this page
for p1, p2 in draw_lines:
    rect = fitz.Rect(p1.x, p1.y - max_lineheight, p2.x, p2.y) # the rectangle "above" a drawn line
    text = page.get_textbox(rect)
    print(f"Underlined: '{text}'.")

3 replies

jhines84 Jun 21, 2022
Author

getting this error when I tried it on

Testing.pdf

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-1833d6c4f862> in <module>
     70             max_lineheight = bbox.height
     71 # we now have the max lineheight on this page
---> 72 for p1, p2 in drawn_lines:
     73     rect = fitz.Rect(p1.x, p1.y - max_lineheight, p2.x, p2.y) # the rectangle "above" a drawn line
     74     text = page.get_textbox(rect)

TypeError: cannot unpack non-iterable ellipsis object

JorjMcKie Jun 21, 2022
Maintainer

Oh dear 😶 this was just to indicate the list of lines you were creating before!

jhines84 Jun 21, 2022
Author

Oh...I had assumed you wanted me to incorporate into your original code

import fitz
doc = fitz.open("Testing.pdf")
page = doc[0]
paths = page.get_drawings()  # get drawings on the page
print(len(paths))  # see how many
#Out[5]: 9

drawn_lines=[...]  # your identified lines
blocks=page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
max_lineheight=0
for b in blocks:
    for l in b["lines"]:
        bbox=fitz.Rect(l["bbox"])
        if bbox.height > max_lineheight:
            max_lineheight = bbox.height

# make a list of words
words = page.get_text("words", sort=True)
#print(words)
# if underlined, the bottom left / right of a word
# should not be too far away from left / right end of some line:
for w in words:  # w[4] is the actual word string
    r = fitz.Rect(w[:4])   # first 4 items are the word bbox
    print(r)
    for p1, p2 in drawn_lines:
        rect = fitz.Rect(p1.x, p1.y - max_lineheight, p2.x, p2.y) # the rectangle "above" a drawn line
        text = page.get_textbox(rect)
        print(f"Underlined: '{text}'.")

Extracting underline texts from PDF #1756

Uh oh!

jhines84 Jun 16, 2022

Replies: 5 comments · 20 replies

Uh oh!

JorjMcKie Jun 16, 2022 Maintainer

Uh oh!

jhines84 Jun 16, 2022 Author

Uh oh!

JorjMcKie Jun 16, 2022 Maintainer

Uh oh!

jhines84 Jun 16, 2022 Author

Uh oh!

JorjMcKie Jun 16, 2022 Maintainer

Uh oh!

jhines84 Jun 16, 2022 Author

Uh oh!

JorjMcKie Jun 16, 2022 Maintainer

Uh oh!

jhines84 Jun 16, 2022 Author

Uh oh!

Uh oh!

JorjMcKie Jun 17, 2022 Maintainer

Uh oh!

JorjMcKie Jun 20, 2022 Maintainer

Uh oh!

jhines84 Jun 20, 2022 Author

Uh oh!

jhines84 Jun 20, 2022 Author

Uh oh!

JorjMcKie Jun 20, 2022 Maintainer

Uh oh!

jhines84 Jun 21, 2022 Author

Uh oh!

JorjMcKie Jun 20, 2022 Maintainer

Uh oh!

jhines84 Jun 20, 2022 Author

Uh oh!

JorjMcKie Jun 20, 2022 Maintainer

Uh oh!

jhines84 Jun 20, 2022 Author

Uh oh!

JorjMcKie Jun 20, 2022 Maintainer

Uh oh!

jhines84 Jun 20, 2022 Author

Uh oh!

JorjMcKie Jun 20, 2022 Maintainer

Uh oh!

jhines84 Jun 21, 2022 Author

Uh oh!

JorjMcKie Jun 21, 2022 Maintainer

Uh oh!

Uh oh!

jhines84 Jun 21, 2022 Author

jhines84
Jun 16, 2022

Replies: 5 comments 20 replies

JorjMcKie
Jun 16, 2022
Maintainer

jhines84 Jun 16, 2022
Author

JorjMcKie
Jun 16, 2022
Maintainer

jhines84 Jun 16, 2022
Author

JorjMcKie Jun 16, 2022
Maintainer

jhines84 Jun 16, 2022
Author

JorjMcKie Jun 16, 2022
Maintainer

jhines84 Jun 16, 2022
Author

JorjMcKie
Jun 17, 2022
Maintainer

JorjMcKie Jun 20, 2022
Maintainer

jhines84 Jun 20, 2022
Author

jhines84 Jun 20, 2022
Author

JorjMcKie Jun 20, 2022
Maintainer

jhines84 Jun 21, 2022
Author

JorjMcKie
Jun 20, 2022
Maintainer

jhines84 Jun 20, 2022
Author

JorjMcKie Jun 20, 2022
Maintainer

jhines84 Jun 20, 2022
Author

JorjMcKie Jun 20, 2022
Maintainer

jhines84 Jun 20, 2022
Author

JorjMcKie
Jun 20, 2022
Maintainer

jhines84 Jun 21, 2022
Author

JorjMcKie Jun 21, 2022
Maintainer

jhines84 Jun 21, 2022
Author