New hOCR renderer - testing and feedback requested #1194

jbarlow83 · 2023-11-21T08:31:11Z

jbarlow83
Nov 21, 2023
Maintainer

I rewrote the hocr renderer, essentially by copying what Tesseract does in the sandwich renderer (Tesseract's PDF generator) and fixing its issues.

I am intending to make this renderer the default, with sandwich as a fallback. I'd appreciate testing and feedback since this affects everything OCRmyPDF outputs.

It is in the feature/modernhocr branch. git clone --branch feature/modernhocr https://github.com/ocrmypdf/OCRmyPDF.git

I believe I have solved/improved:

Arabic, Hebrew and other right to left languages rendering incorrectly in sandwich
The notorious wordssmushedtogetherwithoutspaces issue in Latin languages
Some character output problems with German Fraktur [Bug]: sandwich renders differently than hocr #1191
Better text positioning of text on a skewed baseline, compared to sandwich OCR picks up all the text, but alignment is off #1009
hocr renderer now supports all languages, not just Latin
Non-Tesseract renderers can be adapted more easily more easily to OCRmyPDF, since it's just a matter of converting their JSON output to hOCR.

The cost is:

hocr renderer no longer renders a visible text layer underneath the page image - it uses the "GlyphlessFont" from Tesseract

Asian language characters still have problems with extra word breaks. The hOCR output from Tesseract separates words differently.
For example, in one test document in #715, the characters in the first bullet
进入新发展阶段
have no spaces, but Tesseract reports them as
进入 / 新 / 发 / 展 / 阶段 where each / is an explicit word break. Is there any consistent way to determine where spaces between glyphs are expected in Asian languages?

Some of the new code ought to be in pikepdf and will be migrated there eventually.

femifrak · 2023-11-21T19:31:32Z

femifrak
Nov 21, 2023

I like the idea that the hocr transformer could become the standard, because that promises a permanent future for it. I'm relying on it because I want to replace "archaic separators" #907 with normal ones.
The last comment in #1158 sounds great!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New hOCR renderer - testing and feedback requested #1194

{{title}}

Replies: 1 comment

{{title}}

Select a reply

New hOCR renderer - testing and feedback requested #1194

jbarlow83 Nov 21, 2023 Maintainer

Replies: 1 comment

femifrak Nov 21, 2023

jbarlow83
Nov 21, 2023
Maintainer

femifrak
Nov 21, 2023