--tesseract-timeout=0 and skipping ocr #1266

nikitar · 2024-02-29T16:35:19Z

nikitar
Feb 29, 2024

I'm looking at the Don’t actually OCR my PDF section of the docs and trying that, but I'm still getting ocr'd results. (I'm checking by opening pdf in chrome, no text selection before and text selection after. I can also tell it's doing ocr because it takes a while on multipage documents)

Do I misunderstand the api?

The command: ocrmypdf --tesseract_timeout=0 --output-type=pdf in.pdf out.pdf
Version: 16.0.3
Platform: macos 14.3.1

nikitar · 2024-02-29T16:40:44Z

nikitar
Feb 29, 2024
Author

Additionally, the docs appear to conflict with regards to skipping ocr. Here is a section called Optimize images without performing OCR, which recommends --skip-text in addition to --tesseract-timeout=0. (Because otherwise the invocation will fail on a document with text)

In the Advanced section, however, it says no image processing takes place when --skip-text is used. That seems to conflict with the "optimize images" part above.

If --skip-text is issued, then no image processing or OCR will be performed on pages that already have text. The page will be copied to the output.

Is one of those incorrect, or is there extra context I am missing?

3 replies

jbarlow83 Feb 29, 2024
Maintainer

Image preprocessing is not optimization, at least not in the terminology I've adopted.

Image preprocessing is deskewing and cleaning. I used to calling that image preprocessing everywhere through, but it looks a bit odd if you have OCR turned off to say that the file was preprocessed then it's done, so the label shown is "image processing".

Optimization is reducing image size after performing OCR (because we don't want to degrade the images in any way before OCR).

--skip-text --tesseract-timeout=0 ensures no image processing (just optimization)

I suppose I should probably add new CLI drivers sometime to optimizemypdf (?) processmypdf (too generic?) or something. But I cringe at the obvious name. To 2024 eyes, even ocrmypdf reads as a bit of a silly name, so I'm left with the question of what to call these things.

The optimization is also not a general PDF optimizer; it's an optimizer that's designed around optimizing scanned images after you've done OCR with some fairly specific assumptions that aggressive JPEG and lossy PNG compression are better than downsampling (they are, usually - lossy compression can be thought of as a variable resolution downsampling, that ideally concentrates on salient features at the expense of less relevant ones). There's a lot of possible optimization opportunities not taken.

nikitar Mar 1, 2024
Author

Thank you, it makes sense! As for the name, I think if the tool is useful, people don't mind the name one way or another. And this seems like a very useful too, though I'm still new to it.

Regarding the original question, is that a bug, or do I misunderstand the api? I also tried version 14.0.4, and that one does seem to skip ocr when using the same two flags.

nikitar Mar 6, 2024
Author

Resolved, a local misconfiguration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--tesseract-timeout=0 and skipping ocr #1266

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

--tesseract-timeout=0 and skipping ocr #1266

nikitar Feb 29, 2024

Replies: 1 comment · 3 replies

nikitar Feb 29, 2024 Author

jbarlow83 Feb 29, 2024 Maintainer

nikitar Mar 1, 2024 Author

nikitar Mar 6, 2024 Author

nikitar
Feb 29, 2024

Replies: 1 comment 3 replies

nikitar
Feb 29, 2024
Author

jbarlow83 Feb 29, 2024
Maintainer

nikitar Mar 1, 2024
Author

nikitar Mar 6, 2024
Author