-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible? #46
Comments
I think what I am asking can be achieved by getting tesseract to use the "hocr" option which will cause it to output html which includes box coordinates for each text item. Now the question is, can the module pass this? |
Ok, I've modified tesseract.js inserting:
at line 70, this results in the output being HTML with box coordinates for every text item, is there another way of doing without modified tesseract.js ? |
After searching around, it seems the built in supported way to do this is to add a 'format' option to the options array specifying 'hocr' as the value. [edit]...unfortunately it didn't help...back to using the solution in the previous post. |
Does anyone maintain this module anymore? |
You are honestly better off using a library that has native bindings to tesseract. Or just replicate what this does, this library doesn't do anything special - in fact you could re-write it a lot cleaner with ES6 syntax |
@reecefenwick, thank you, I did a search around today and from what I was able to find node-tesseract seems to be the best module for node.js I will modify the code tonight and implement "hocr" via the options. I've also ordered a book on ES6 as so far I haven't been familiar with it or what it can do. |
I think you can first modify the default var options at line22 of tesseract.js:
then at line 70,add :
in your code ,if you want to get hocr output ,do something like this:
|
For PDF's that contain text I am using pdf2json which gives me all the text nodes and PDF co-ordinates, for PDF's that do contain text I am using node-tesseract, however this extracts just the text, is it possible to get the co-ordinates of the text to go along with the output?
The text was updated successfully, but these errors were encountered: