You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to OCR an old PDF and OCRmyPDF is actually doing a great job.
But next step in my workflow would be to use Google Translate to translate it from English to Dutch. The result looks like this:
The processed image text from the original PDF is not removed, which makes sense (how would Google know?).
Is there an option to OCRmyPDF to actually remove the image-with-text from the PDF that resulted in the OCR content? I do not want to remove all images; the PDF also contains pictures that should be kept.
Regards!
The text was updated successfully, but these errors were encountered:
You can use Ghostscript to regenerate the PDF, suppressing images:
gs -q -dFILTERIMAGE -o out.pdf in.pdf
As you say, this removals all images.
Correlating the image that produced OCR text to an image in a document is difficult -- we render all content on each page as a whole image and send it for OCR. Intelligently removing the text from that image is even more difficult. I expect it would be easier to use some sort of commercial OCR that can reconstruct the document as say, a Word document, and then perform translation there.
Hi,
I'm trying to OCR an old PDF and OCRmyPDF is actually doing a great job.
But next step in my workflow would be to use Google Translate to translate it from English to Dutch. The result looks like this:
The processed image text from the original PDF is not removed, which makes sense (how would Google know?).
Is there an option to OCRmyPDF to actually remove the image-with-text from the PDF that resulted in the OCR content? I do not want to remove all images; the PDF also contains pictures that should be kept.
Regards!
The text was updated successfully, but these errors were encountered: