Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to process pdfs - Windows #43

Closed
fraserpage opened this issue Apr 20, 2016 · 8 comments
Closed

Unable to process pdfs - Windows #43

fraserpage opened this issue Apr 20, 2016 · 8 comments
Assignees
Milestone

Comments

@fraserpage
Copy link

fraserpage commented Apr 20, 2016

I'm seeing the following on Windows 10. You assistance would be greatly appreciated.

Syntax Warning: Bad annotation destination
Syntax Warning: Bad annotation destination

I see about 30 lines of the above when using trying to process a pdf with pypdfocr filename.pdf.
I see the below with any usage.

WARNING: Could not execute identify to calculate DPI (try installing imagemagick?), so defaulting to 300dpi
Traceback (most recent call last):
File "", line 495, in
File "", line 492, in main
File "", line 474, in go
File "", line 480, in _convert_and_file_email
File "", line 359, in run_conversion
File "C:\Users\Virantha Ekanayake\dev\pypdfocr\build\pypdfocr_windows\out00-PYZ.pyz\pypdfocr_tesseract", line 130, in make_hocr_from_pnms
File "C:\Users\Virantha Ekanayake\dev\pypdfocr\build\pypdfocr_windows\out00-PYZ.pyz\pypdfocr_tesseract", line 96, in _is_version_uptodate
ValueError: invalid literal for int() with base 10: '00dev'

All dependencies are installed.

@flothesof
Copy link

Hi @fraserpage

I've had the same issue. In my case, the code for parsing the version string used by tesseract does not work as intended by the author. In particular, my version string was 3.05.00dev which caused the same error as you when the script tried to parse it and determine whether it was correct.

As a workaround, you can add the following bold lines to the file pypdfocr_tesseract.py found in python27\Lib\site-packages\pypdfocr:

for line in ret_output.splitlines():
            if 'tesseract' in line:
                ver_str = line.split(' ')[1]
                **if ver_str.endswith('dev'):
                    ver_str = ver_str[:-3]**

Hope this helps,

Florian

@fraserpage
Copy link
Author

Thanks very much @flothesof! That got it working for me.

I'm still seeing the warning about imagemagick. Any clues on that one?
WARNING: Could not execute identify to calculate DPI (try installing imagemagick?), so defaulting to 300dpi

@flothesof
Copy link

Hey there!

The warning is normal, the program is just telling us it would like to do
some additional checks before adding ocr. I've found no problems with the
default resolution while using it.

Best regards
Florian
Le 14 juin 2016 21:35, "fraserpage" [email protected] a écrit :

Thanks very much @flothesof https://github.com/flothesof! That got it
working for me.

I'm still seeing the warning about imagemagick. Any clues on that one?
WARNING: Could not execute identify to calculate DPI (try installing
imagemagick?), so defaulting to 300dpi


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#43 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACQFXaASjNyIuOBM-QYuthAuoKnoaWFLks5qLwJ4gaJpZM4IMFrJ
.

@fraserpage
Copy link
Author

Got it. Thanks for your help!

@virantha
Copy link
Owner

Going to reopen and fix this in source for next release. Thanks for pointing this out, folks!

@virantha virantha reopened this Jun 23, 2016
@virantha virantha added this to the 0.9.1 milestone Jun 23, 2016
@virantha virantha self-assigned this Jun 23, 2016
@dwmcqueen
Copy link

Hi - can the exe be fixed with this same patch?

@rasa
Copy link

rasa commented Jan 8, 2017

@flothesof: The warning message is actually not normal, but is reporting an error on Windows. This has been fixed in #54

@qi55wyqu
Copy link

qi55wyqu commented Nov 8, 2017

I'm still running into this problem with the word alpha in Version 0.9.1
My added fix for this (based on flothesof's answer):

checkFileEndings = ['dev', 'alpha']
for line in ret_output.splitlines():
    if 'tesseract' in line:
        ver_str = line.split(' ')[1]
        for fileEnding in checkFileEndings:
            if ver_str.endswith(fileEnding):
                ver_str = ver_str[:-len(fileEnding)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants