Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible? #46

Open
SPlatten opened this issue May 19, 2017 · 7 comments
Open

Is it possible? #46

SPlatten opened this issue May 19, 2017 · 7 comments

Comments

@SPlatten
Copy link

For PDF's that contain text I am using pdf2json which gives me all the text nodes and PDF co-ordinates, for PDF's that do contain text I am using node-tesseract, however this extracts just the text, is it possible to get the co-ordinates of the text to go along with the output?

@SPlatten
Copy link
Author

I think what I am asking can be achieved by getting tesseract to use the "hocr" option which will cause it to output html which includes box coordinates for each text item. Now the question is, can the module pass this?

@SPlatten
Copy link
Author

Ok, I've modified tesseract.js inserting:

    command.push("hocr");

at line 70, this results in the output being HTML with box coordinates for every text item, is there another way of doing without modified tesseract.js ?

@SPlatten
Copy link
Author

SPlatten commented May 22, 2017

After searching around, it seems the built in supported way to do this is to add a 'format' option to the options array specifying 'hocr' as the value.

[edit]...unfortunately it didn't help...back to using the solution in the previous post.

@SPlatten
Copy link
Author

Does anyone maintain this module anymore?

@reecefenwick
Copy link

You are honestly better off using a library that has native bindings to tesseract.

Or just replicate what this does, this library doesn't do anything special - in fact you could re-write it a lot cleaner with ES6 syntax

@SPlatten
Copy link
Author

@reecefenwick, thank you, I did a search around today and from what I was able to find node-tesseract seems to be the best module for node.js

I will modify the code tonight and implement "hocr" via the options. I've also ordered a book on ES6 as so far I haven't been familiar with it or what it can do.

@gforcelong
Copy link

I think you can first modify the default var options at line22 of tesseract.js:

        options: {
               'l': 'eng',
               'psm': 3,
               'config': null,
               'binary': 'tesseract',
               'hocr':null
   },

then at line 70,add :

            if (options.hocr !== null) {
              command.push('hocr');
              }

in your code ,if you want to get hocr output ,do something like this:

       var options = {
            l: 'chi_sim+eng',
           psm: 4,
           hocr:'hocr'
         };

   tesseract.process( '/test.png', options, function(err, text) {
          if(err) {
                console.error(err);
           } else {
 	            console.log('----------------------------');
             console.log(text);
  }
});     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants