Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to remove the image-with-text from the PDF #1393

Closed
SurinameClubcard opened this issue Sep 8, 2024 · 2 comments
Closed

How to remove the image-with-text from the PDF #1393

SurinameClubcard opened this issue Sep 8, 2024 · 2 comments

Comments

@SurinameClubcard
Copy link

Hi,

I'm trying to OCR an old PDF and OCRmyPDF is actually doing a great job.

But next step in my workflow would be to use Google Translate to translate it from English to Dutch. The result looks like this:

image

The processed image text from the original PDF is not removed, which makes sense (how would Google know?).

Is there an option to OCRmyPDF to actually remove the image-with-text from the PDF that resulted in the OCR content? I do not want to remove all images; the PDF also contains pictures that should be kept.

Regards!

@0xE1
Copy link

0xE1 commented Nov 2, 2024

This is something I'm looking for as well, essentially need a way to remove portion of the background image where some kind of text was recognized.

@jbarlow83
Copy link
Collaborator

You can use Ghostscript to regenerate the PDF, suppressing images:

gs -q -dFILTERIMAGE -o out.pdf in.pdf

As you say, this removals all images.

Correlating the image that produced OCR text to an image in a document is difficult -- we render all content on each page as a whole image and send it for OCR. Intelligently removing the text from that image is even more difficult. I expect it would be easier to use some sort of commercial OCR that can reconstruct the document as say, a Word document, and then perform translation there.

@ocrmypdf ocrmypdf locked and limited conversation to collaborators Nov 3, 2024
@jbarlow83 jbarlow83 converted this issue into discussion #1418 Nov 3, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@SurinameClubcard @jbarlow83 @0xE1 and others