New hOCR renderer - testing and feedback requested #1194
jbarlow83
announced in
Announcements
Replies: 1 comment
-
I like the idea that the hocr transformer could become the standard, because that promises a permanent future for it. I'm relying on it because I want to replace "archaic separators" #907 with normal ones. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I rewrote the hocr renderer, essentially by copying what Tesseract does in the sandwich renderer (Tesseract's PDF generator) and fixing its issues.
I am intending to make this renderer the default, with sandwich as a fallback. I'd appreciate testing and feedback since this affects everything OCRmyPDF outputs.
It is in the feature/modernhocr branch.
git clone --branch feature/modernhocr https://github.com/ocrmypdf/OCRmyPDF.git
I believe I have solved/improved:
The cost is:
Asian language characters still have problems with extra word breaks. The hOCR output from Tesseract separates words differently.
For example, in one test document in #715, the characters in the first bullet
进入新发展阶段
have no spaces, but Tesseract reports them as
进入 / 新 / 发 / 展 / 阶段 where each / is an explicit word break. Is there any consistent way to determine where spaces between glyphs are expected in Asian languages?
Some of the new code ought to be in pikepdf and will be migrated there eventually.
Beta Was this translation helpful? Give feedback.
All reactions