Tesseract OCR - Ubuntu and Alpine linux images.
Tesseract and Leptonica are both built from source for each platform and distro, supported platforms are amd64 (x86_64) arm64 (aarch64).
Versions indicate OS version (or the name in case of alpine), the images with 4-
prefix uses
tesseract version 4 while images without the prefix uses version 5.
All versions use the same training data.
Images can be found at:
- Docker hub:
jitesoft/tesseract-ocr
- GitLab:
registry.gitlab.com/jitesoft/dockerfiles/tesseract
- GitHub:
ghcr.io/jitesoft/tesseract
- Quay:
quay.io/jitesoft/tesseract
Dockerfile can be found at GitLab or GitHub
The default image have the english training data installed from start. The training data used is the "fast" data. It parses quicker but not at best quality.
It's possible to train another language by invoking the train-lang
script, followed by the language code (ISO 639-2 eng
, swe
etc). If you wish to use fast
or best
, add that as an optional parameter after the language code (train-lang eng --fast
) else use the standard without any extra arg.
The above could easily be done in a derived image:
FROM jitesoft/tesseract-ocr
RUN train-lang bul --fast
The languages are downloaded from the official tesseract tessdata repositories.
For a full list of supported languages check the following links:
https://github.com/tesseract-ocr/tessdata
https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast
It is also possible to just copy a traineddata file to the /usr/local/share/tessdata
(/usr/share/tessdata
on alpine) directory of the container.
docker pull jitesoft/tesseract-ocr
docker run -v /path/to/image/img.jpg:/tmp/img.jpg jitesoft/tesseract-ocr /tmp/img.jpg stdout
Use high DPI image for best result. Higher DPI does increase the time to run though.
This image follows the Jitesoft image label specification 1.0.0.
The images and scripts in the repository are released under the MIT license.
Tesseract is released under the Apache License v2
Notice: The tesseract source have been modified with a patch (alpine/tess.patch
) to allow for compilation in alpine linux.
Jitesoft images are built via GitLab CI on runners hosted by the following wonderful organisations:
The companies above are not affiliated with Jitesoft or any Jitesoft Projects directly.
Sponsoring is vital for the further development and maintaining of open source.
Questions and sponsoring queries can be made by email.
If you wish to sponsor our projects, reach out to the email above or visit any of the following sites: