[Bug]: The file size increases significantly by OCR even without image recompression #1278

ybeltukov · 2024-03-17T01:40:58Z

Describe the bug

I'm trying make OCR of scanned books and preserve the small size of the input file. To do so I use --output-type pdf option. However, the size is increased by 40% even without image recompression.

Moreover, the size is increased even further after the second pass despite redo-ocr flag.

My current version is 16.1.1 installed on Arch Linux from AUR repository.
In a previous version (16.0.4 or so) I did not notice such an increase in the file size.

I observe such a problem for various files with high enough compression. Below, a part of such a book is attached as an example.

Steps to reproduce

ocrmypdf --output-type pdf --redo-ocr -v1 Watson1.pdf Watson2.pdf
ocrmypdf --output-type pdf --redo-ocr -v1 Watson2.pdf Watson3.pdf

For the given small part of the book the file sizes are:
251 KB → 349 KB → 447 KB

Files

Here is the part of one book.
Watson1.pdf
Watson2.pdf
Watson3.pdf

How did you download and install the software?

No response

OCRmyPDF version

16.1.1

Relevant log output

First pass: https://pastebin.com/UjuiU3E7
Second pass: https://pastebin.com/bWgs185r

The text was updated successfully, but these errors were encountered:

ybeltukov · 2024-03-17T10:31:29Z

It seems to me that it is related to hocr pdf renderer, which is enabled by default now. It produces a better visual quality (see e.g. #1131 ), however it increases the size of OCR layer almost twice.

With the option --pdf-renderer sandwich I obtain the following sizes for the same file:
251 KB → 310 KB → 369 KB

So the OCR layer takes 59 KB for sandwich and 98 KB for hocr

So the questions are:

Is it possible to optimize hocr renderer?
Is it possible to remove previously added OCR layer without image recompression?
(--force-ocr is not suitable for this task)

Jmuccigr · 2024-04-01T10:30:28Z

I wrote a script that redoes the ocr on PDFs by deleting any original text from the file and then using ocrmypdf to generate new ocr which I then add to the original file. I use it mainly to replace the often bad ocr in jstor files. It uses Ghostscript to remove the text and relies on some other stuff that you can see in the code.

ybeltukov added the bug label Mar 17, 2024

ybeltukov assigned jbarlow83 Mar 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: The file size increases significantly by OCR even without image recompression #1278

[Bug]: The file size increases significantly by OCR even without image recompression #1278

ybeltukov commented Mar 17, 2024 •

edited

Loading

ybeltukov commented Mar 17, 2024 •

edited

Loading

Jmuccigr commented Apr 1, 2024

[Bug]: The file size increases significantly by OCR even without image recompression #1278

[Bug]: The file size increases significantly by OCR even without image recompression #1278

Comments

ybeltukov commented Mar 17, 2024 • edited Loading

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

ybeltukov commented Mar 17, 2024 • edited Loading

Jmuccigr commented Apr 1, 2024

ybeltukov commented Mar 17, 2024 •

edited

Loading

ybeltukov commented Mar 17, 2024 •

edited

Loading