You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying make OCR of scanned books and preserve the small size of the input file. To do so I use --output-type pdf option. However, the size is increased by 40% even without image recompression.
Moreover, the size is increased even further after the second pass despite redo-ocr flag.
My current version is 16.1.1 installed on Arch Linux from AUR repository.
In a previous version (16.0.4 or so) I did not notice such an increase in the file size.
I observe such a problem for various files with high enough compression. Below, a part of such a book is attached as an example.
Steps to reproduce
ocrmypdf --output-type pdf --redo-ocr -v1 Watson1.pdf Watson2.pdf
ocrmypdf --output-type pdf --redo-ocr -v1 Watson2.pdf Watson3.pdf
For the given small part of the book the file sizes are:
251 KB → 349 KB → 447 KB
It seems to me that it is related to hocr pdf renderer, which is enabled by default now. It produces a better visual quality (see e.g. #1131 ), however it increases the size of OCR layer almost twice.
With the option --pdf-renderer sandwich I obtain the following sizes for the same file:
251 KB → 310 KB → 369 KB
So the OCR layer takes 59 KB for sandwich and 98 KB for hocr
So the questions are:
Is it possible to optimize hocr renderer?
Is it possible to remove previously added OCR layer without image recompression?
(--force-ocr is not suitable for this task)
I wrote a script that redoes the ocr on PDFs by deleting any original text from the file and then using ocrmypdf to generate new ocr which I then add to the original file. I use it mainly to replace the often bad ocr in jstor files. It uses Ghostscript to remove the text and relies on some other stuff that you can see in the code.
Describe the bug
I'm trying make OCR of scanned books and preserve the small size of the input file. To do so I use
--output-type pdf
option. However, the size is increased by 40% even without image recompression.Moreover, the size is increased even further after the second pass despite
redo-ocr
flag.My current version is 16.1.1 installed on Arch Linux from AUR repository.
In a previous version (16.0.4 or so) I did not notice such an increase in the file size.
I observe such a problem for various files with high enough compression. Below, a part of such a book is attached as an example.
Steps to reproduce
For the given small part of the book the file sizes are:
251 KB → 349 KB → 447 KB
Files
Here is the part of one book.
Watson1.pdf
Watson2.pdf
Watson3.pdf
How did you download and install the software?
No response
OCRmyPDF version
16.1.1
Relevant log output
The text was updated successfully, but these errors were encountered: