How to remove the image-with-text from the PDF #1393

SurinameClubcard · 2024-09-08T12:48:40Z

Hi,

I'm trying to OCR an old PDF and OCRmyPDF is actually doing a great job.

But next step in my workflow would be to use Google Translate to translate it from English to Dutch. The result looks like this:

The processed image text from the original PDF is not removed, which makes sense (how would Google know?).

Is there an option to OCRmyPDF to actually remove the image-with-text from the PDF that resulted in the OCR content? I do not want to remove all images; the PDF also contains pictures that should be kept.

Regards!

0xE1 · 2024-11-02T13:48:25Z

This is something I'm looking for as well, essentially need a way to remove portion of the background image where some kind of text was recognized.

jbarlow83 · 2024-11-03T22:49:32Z

You can use Ghostscript to regenerate the PDF, suppressing images:

gs -q -dFILTERIMAGE -o out.pdf in.pdf

As you say, this removals all images.

Correlating the image that produced OCR text to an image in a document is difficult -- we render all content on each page as a whole image and send it for OCR. Intelligently removing the text from that image is even more difficult. I expect it would be easier to use some sort of commercial OCR that can reconstruct the document as say, a Word document, and then perform translation there.

ocrmypdf locked and limited conversation to collaborators Nov 3, 2024

jbarlow83 converted this issue into discussion #1418 Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

How to remove the image-with-text from the PDF #1393

How to remove the image-with-text from the PDF #1393

SurinameClubcard commented Sep 8, 2024

0xE1 commented Nov 2, 2024

jbarlow83 commented Nov 3, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

How to remove the image-with-text from the PDF #1393

How to remove the image-with-text from the PDF #1393

Comments

SurinameClubcard commented Sep 8, 2024

0xE1 commented Nov 2, 2024

jbarlow83 commented Nov 3, 2024

This issue was moved to a discussion.