Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google.protobuf.message.DecodeError: Error parsing message #154

Closed
legolego opened this issue Oct 31, 2019 · 41 comments
Closed

google.protobuf.message.DecodeError: Error parsing message #154

legolego opened this issue Oct 31, 2019 · 41 comments

Comments

@legolego
Copy link

legolego commented Oct 31, 2019

Description
I think this is similar to a bug in the old python library:
python-stanford-corenlp.
I'm trying to copy the demo for the client here or here.
but with my own texts... text2 works and text3 doesn't, the only differemce between them in the very last word.

The error I get is:

Traceback (most recent call last):
  File "C:/gitProjects/patentmoto2/scratch4.py", line 23, in <module>
    ann = client.annotate(text)
  File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 403, in annotate
    parseFromDelimitedString(doc, r.content)
  File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\protobuf\__init__.py", line 18, in parseFromDelimitedString
    obj.ParseFromString(buf[offset+pos:offset+pos+size])
google.protobuf.message.DecodeError: Error parsing message

To Reproduce

Steps to reproduce the behavior:


print('---')
print('input text')
print('')

text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."
text2 = "We claim:1. A photographic camera for three dimension photography comprising:a housing having an opening to the interior for light rays;means for immovably locating photosensitive material in communication with the interior of the housing at a location during a time for exposure;optical means in said housing for projecting light rays, which are received through said opening from a scene to be photographed, along an optical path to said location, said path having a first position therealong extending transversely to the direction of the path from a first side to a second side of the path, the optical means comprisinga lenticular screen extending across said path at a second position farther along said path from the first position and having, on one side, a plurality of elongated lenticular elements of width P which face in the direction from which the light rays are being projected and having an opposite side facing and positioned for contact with the surface of such located photosensitive material,the optical means being characterized in that it changes, by a predetermined distance Y, on such surface of the photosensitive material, the position of light rays which come from a substantially common point on such scene and which extend along said first and second sides of said path;means for blocking the received light rays at said first position;an aperture movable transversely across said path at said first position, from said first side to said second said, for exposing said light rays sequentially to the photosensitive material moving across said screen in a direction normal to the elongation of said lenticular elements; andmeans for so moving said aperture for a predetermined time for exposure while simultaneously and synchronously moving said screen, substantially throughout said predetermined time for exposure, in substantially the same direction as the light rays sequentially expose said photosensitive material and over a distance substantially equal to the sum of P + Y to thereby expose a substantially continuous unreversed image of the scene on the photosensitive material, said means for and doing this all day long and."
text3 = "We claim:1. A photographic camera for three dimension photography comprising:a housing having an opening to the interior for light rays;means for immovably locating photosensitive material in communication with the interior of the housing at a location during a time for exposure;optical means in said housing for projecting light rays, which are received through said opening from a scene to be photographed, along an optical path to said location, said path having a first position therealong extending transversely to the direction of the path from a first side to a second side of the path, the optical means comprisinga lenticular screen extending across said path at a second position farther along said path from the first position and having, on one side, a plurality of elongated lenticular elements of width P which face in the direction from which the light rays are being projected and having an opposite side facing and positioned for contact with the surface of such located photosensitive material,the optical means being characterized in that it changes, by a predetermined distance Y, on such surface of the photosensitive material, the position of light rays which come from a substantially common point on such scene and which extend along said first and second sides of said path;means for blocking the received light rays at said first position;an aperture movable transversely across said path at said first position, from said first side to said second said, for exposing said light rays sequentially to the photosensitive material moving across said screen in a direction normal to the elongation of said lenticular elements; andmeans for so moving said aperture for a predetermined time for exposure while simultaneously and synchronously moving said screen, substantially throughout said predetermined time for exposure, in substantially the same direction as the light rays sequentially expose said photosensitive material and over a distance substantially equal to the sum of P + Y to thereby expose a substantially continuous unreversed image of the scene on the photosensitive material, said means for and doing this all day long and his."

text = text3
print(text)


print('---')
print('starting up Java Stanford CoreNLP Server...')


with CoreNLPClient(endpoint="http://localhost:9000", annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'depparse', 'coref'],
                   timeout=70000, memory='16G', threads=10, be_quiet=False) as client:

    ann = client.annotate(text)


    sentence = ann.sentence[0]


    print('---')
    print('constituency parse of first sentence')
    constituency_parse = sentence.parseTree
    print(constituency_parse)

Expected behavior
I expect it to finish. text=text2 succeeds, but text=text3 fails with the above error. The only difference between the texts is the last word 'his' (could really be anything I think).

Environment:

  • OS: Windows 10
  • Python version: 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
  • CoreNLP 3.9.2
  • corenlp-protobuf==3.8.0
  • protobuf==3.10.0
  • stanfordnlp==0.2.0
  • torch==1.1.0

Additional context
I've also gotten a timeout error for some sentences, but it's intermittent. I'm not sure of they're related, but this is easier to reproduce.

@legolego legolego added the bug label Oct 31, 2019
@HsiehTommy
Copy link

Got the same problem here. I would like to know what the problem actually means?

@yuhaozhang
Copy link
Member

Hi @legolego,

I ran your provided example, and both text2 and text3 gave me an timeout issue. I think the main problem is that both of the sentences are too long for CoreNLP to process in a specified timeout period. This is especially true for the dependency parser and the coref annotator, which slow down significantly for very long sentences. After removing depparse and coref from the annotator list, I am able to annotate both sentences, and did not see any protobuf issue.

So in general I wasn't able to reproduce the protobuf error that you mentioned. Are you still seeing the same issue (also even after removing depparse and coref)?

@legolego
Copy link
Author

Thanks for the reply :)
I tried the code with removing depparse and coref but no change. I haven't had a timeout issue recently.

Running with text2 and depparse and coref removed completes in 18 seconds.
Running with text3 and depparse and coref removed fails at 18 seconds.

Running with text2 and depparse and coref NOT removed completes in 36 seconds.
Running with text3 and depparse and coref NOT removed fails at 36 seconds.

I don't think it is a timeout issue. Is there some other way to debug?

@yuhaozhang
Copy link
Member

I was not able to run the test on a Windows system. I wonder if this is an issue only on Windows. Do you have access to a Linux or macOS system where you can test out the above example?

@legolego
Copy link
Author

It would take me a while, I don't have good access to a different machine.

@yuhaozhang
Copy link
Member

Actually, can you try removing parse while keeping depparse and coref? I think on my side, it is parse that's causing the timeout issue, so I wonder if it is also parse causing your protobuf issue.

@legolego
Copy link
Author

Ok, I tried a few different ways, including starting the server from a command prompt like this:

java -Xms2048m -Xmx5632m -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 120000 -inputFormat 'text' -outputFormat 'json' -be_quiet false -serializer 'edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer' -tokenize.options 'ptb3Escaping=false,invertible=true' -tokenize.language 'en' -annotators tokenize,ssplit,pos,lemma,ner,depparse,coref

The first time I run it after newly starting the server, but not in subsequent runs, with either text2 or text3, I see this in my DOS window:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/C:/stanford-corenlp-full-2018-10-05/protobuf.jar) to field java.nio.Buffer.address
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

My Java version is:

openjdk 10.0.2 2018-07-17
OpenJDK Runtime Environment 18.3 (build 10.0.2+13)
OpenJDK 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode)

I changed my code a little to this:

with CoreNLPClient(start_server=False, endpoint="http://localhost:9000") as client:
    # submit the request to the server
    ann = client.annotate(text)

    # get the first sentence
    sentence = ann.sentence[0]

    print("sentence: ", sentence)

    # get the constituency parse of the first sentence
    print('---')
    print('constituency parse of first sentence')
    constituency_parse = sentence.parseTree
    print("const_parse: ", constituency_parse)
    print("finish")

And it finishes without the protobuf error, but it prints nothing after "const_parse: "
I do have all the tokens in sentence but upon inspection of sentence.parseTree I see this:

'Traceback (most recent call last):
  File "C:\\Program Files\\JetBrains\\PyCharm 2019.2.2\\helpers\\pydev\\_pydevd_bundle\\pydevd_resolver.py", line 178, in _getPyDictionary
    attr = getattr(var, n)
AttributeError: Extensions
'

@yuhaozhang
Copy link
Member

For your first trial with command-line start of the CoreNLP server above, I am no expert of protobuf, but the warning message does suggest that it is an issue with protobuf rather than the CoreNLP server.

For your second trial with Python start of the CoreNLP server above, it is expected that you won't see any constituency parse output, because you are not specifying annotators, and the default annotator list does not include parse (see https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/server/client.py#L40). And then when you tried to inspect sentence.parseTree you will have trouble accessing it because it was never created in the first place.

With all that said, I think neither of these two is related to the original protobuf decoding error, which I cannot reproduce on macOS or Ubuntu.

@HsiehTommy, when you said you encountered the same error, are you also using a Windows OS?

@legolego
Copy link
Author

Ok, I tried reinstall some versions of things. I found a new protobuf.jar here:
https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/3.9.2/
and took: protobuf-java-3.9.2.jar, renaming it to protobuf.jar and placed it in C:\stanford-corenlp-full-2018-10-05, that seems to have got rid of the WARNINGs.

I ran the server again by adding parse to the list on annotators, and for text2 I was able to see the parseTree.

I tried reinstalling stanfordnlp with pip, but kept/keep seeing this error:

ERROR: Could not find a version that satisfies the requirement torch>=1.0.0 (from stanfordnlp) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.0.0 (from stanfordnlp)

Which version of torch does it really want? Right now I have the latest nightly of torch installed, but still get that error when pip installing stanfordnlp. The only way I could reinstall it was with no-deps:

python -m pip install --upgrade --force-reinstall --no-deps stanfordnlp

I also manually took the latest versions of all the files in (\venv\Lib\site-packages\stanfordnlp\server and \stanfordnlp\protobuf) from your source code here, but still have the protobuf error, only the line number changed a little(403 -> 432):

Traceback (most recent call last):
  File "C:/gitProjects/patentmoto2/scratch4.py", line 26, in <module>
    ann = client.annotate(text)
  File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 432, in annotate
    parseFromDelimitedString(doc, r.content)
  File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\protobuf\__init__.py", line 18, in parseFromDelimitedString
    obj.ParseFromString(buf[offset+pos:offset+pos+size])
google.protobuf.message.DecodeError: Error parsing message

Is there something else I can try?

@yuhaozhang
Copy link
Member

Regarding stanfordnlp installation: can you try to open a new environment in conda and try installing the package again? Without an isolated environment, it is nearly impossible to tell what went wrong...

Also, why is it saying that from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2? The latest version on pip should be 0.2.0. Did you specify any version when you try to install the package?

@legolego
Copy link
Author

Ok, I'll try conda soon, but for stanfordnlp, I am installing 0.2.0...

(venv) C:\gitProjects\patentmoto2>python -m pip install --upgrade --force-reinstall --no-deps stanfordnlp
Collecting stanfordnlp
  Using cached https://files.pythonhosted.org/packages/41/bf/5d2898febb6e993fcccd90484cba3c46353658511a41430012e901824e94/stanfordnlp-0.2.0-py3-none-any.whl
Installing collected packages: stanfordnlp
  Found existing installation: stanfordnlp 0.2.0
    Uninstalling stanfordnlp-0.2.0:
      Successfully uninstalled stanfordnlp-0.2.0
Successfully installed stanfordnlp-0.2.0

If I try installing it without no-deps, I get the errors:

(venv) C:\gitProjects\patentmoto2>python -m pip install --upgrade --force-reinstall stanfordnlp
Collecting stanfordnlp
  Using cached https://files.pythonhosted.org/packages/41/bf/5d2898febb6e993fcccd90484cba3c46353658511a41430012e901824e94/stanfordnlp-0.2.0-py3-none-any.whl
Collecting protobuf
  Using cached https://files.pythonhosted.org/packages/a8/ae/a11b9b0c8e2410b11887881990b71f54ec39b17c4de2b5d850ef66aade8c/protobuf-3.10.0-cp37-cp37m-win_amd64.whl
Collecting tqdm
  Using cached https://files.pythonhosted.org/packages/b9/08/8505f192efc72bfafec79655e1d8351d219e2b80b0dec4ae71f50934c17a/tqdm-4.38.0-py2.py3-none-any.whl
ERROR: Could not find a version that satisfies the requirement torch>=1.0.0 (from stanfordnlp) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.0.0 (from stanfordnlp)

and I didn't understand which library is complaining with this:
(from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2) Is it that version of stanfordnlp, or pytorch?

@yuhaozhang
Copy link
Member

Yes those versions do look like stanfordnlp versions, not pytorch versions.

This is really strange. With some quick search, I have found this StackOverflow issue, which suggests that you are the only one with this issue when installing stanfordnlp. However, the good news is that as long as you can install torch separately before, things will work fine.

And then I found this issue, which suggests that certain versions of torch are not available for Windows on PyPI, which could be causing this. And then if you go to the pytorch website, their suggested pip installation for Windows is pip3 install torch===1.3.1 torchvision===0.4.2 -f https://download.pytorch.org/whl/torch_stable.html, which further confirms this hypothesis.

So the current solution for Windows system is: 1) installing pytorch first following the official instruction; 2) installing stanfordnlp.

@yuhaozhang
Copy link
Member

I think going forward one way to solve this for Windows users is to have official conda distributions, given that pytorch has better Windows support on conda.

@legolego
Copy link
Author

I'll try the conda install here shortly with a new python project. My current non-conda project worked previously on a different Windows computer, but I've had problems migrating to a new computer, and also to the new stanfordnlp library from the corenlp python library I was using before.... pytorch is a new requirement that wasn't there before. I'll let you know how it goes, thank you!

@legolego
Copy link
Author

I tried conda and have the same protobuf error with text3, but not text2...

starting up Java Stanford CoreNLP Server...
Traceback (most recent call last):
  File "C:/gitProjects/patentmoto3/scratch01.py", line 26, in <module>
    ann = client.annotate(text)
  File "C:\Users\Oleg\Anaconda3\envs\patentmoto3\lib\site-packages\stanfordnlp\server\client.py", line 403, in annotate
    parseFromDelimitedString(doc, r.content)
  File "C:\Users\Oleg\Anaconda3\envs\patentmoto3\lib\site-packages\stanfordnlp\protobuf\__init__.py", line 18, in parseFromDelimitedString
    obj.ParseFromString(buf[offset+pos:offset+pos+size])
google.protobuf.message.DecodeError: Error parsing message

Process finished with exit code 1

@legolego
Copy link
Author

I just tried the other python-corenlp library(https://github.com/stanfordnlp/python-stanford-corenlp) and I'm getting a similar message. This is in the new project with conda.

With these changes from before(https://github.com/stanfordnlp/stanfordnlp/issues/154#issuecomment-553713985):

#from stanfordnlp.server import CoreNLPClient
import corenlp

with corenlp.CoreNLPClient(start_server=False, endpoint="http://localhost:9000", annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'depparse', 'coref', 'parse']) as client:

and everything else the same, my error is now:

Traceback (most recent call last):
  File "C:/gitProjects/patentmoto3/scratch01.py", line 26, in <module>
    ann = client.annotate(text)
  File "C:\Users\Oleg\Anaconda3\envs\patentmoto3\lib\site-packages\corenlp\client.py", line 229, in annotate
    parseFromDelimitedString(doc, r.content)
  File "C:\Users\Oleg\Anaconda3\envs\patentmoto3\lib\site-packages\corenlp_protobuf\__init__.py", line 18, in parseFromDelimitedString
    obj.ParseFromString(buf[offset+pos:offset+pos+size])
google.protobuf.message.DecodeError: Error parsing message

Is there something in the formatting of my text maybe? It's strange both libraries give me basically the same error... Where else can I look?

@J38
Copy link
Collaborator

J38 commented Nov 17, 2019

If you start the client with be_quiet=False what is the output from the server ?
Do I understand correctly it works fine for text2 but not text3 ?

@J38
Copy link
Collaborator

J38 commented Nov 17, 2019

Also its possible you are running out of RAM if you are running a constituency parse of a ridiculously long sentence like that.

@legolego
Copy link
Author

Thank you for replying! I don't get anything obvious with be_quiet=False. Just this:
[pool-1-thread-11] INFO CoreNLP - [/0:0:0:0:0:0:0:1:54431] API call w/annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref
followed by the text I sent in.

And yes, text2 is good, but it fails for text3. I'm analyzing patent text and those sentences can be very, very long unfortunately. I have 16GB of memory on my laptop. I started the server with:

c:\stanford-corenlp-full-2018-10-05>java -Xms2048m -Xmx5632m -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 120000 -inputFormat 'text' -outputFormat 'json' -be_quiet false -serializer 'edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer' -tokenize.options 'ptb3Escaping=false,invertible=true' -tokenize.language 'en' -annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref

The client line in python is:
with CoreNLPClient(start_server=False, be_quiet=False, endpoint="http://localhost:9000", annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'depparse', 'coref', 'parse']) as client:
I also made a smaller text string that also fails.
text4 = "in said housing for light rays, which are received opening scene to be path to said path at a position a plurality of lenticular elements of width P which face in the having a facing and for the being in that it, sides of an aperture movable transversely said path at said position, from said first side to said for said light rays sequentially to the material moving across said screen time for and synchronously moving said screen, substantially throughout said predetermined time for exposure, in the direction as the rays sequentially expose said photosensitive material and over a distance substantially equal to the long and his."

@J38
Copy link
Collaborator

J38 commented Nov 17, 2019

Does it fail if you remove parse ? It would be helpful to find the smallest annotation set that causes failure. That is, will that sentence work if you just use tokenize,ssplit,pos ? Also, I don't know that this really matters, but conventionally we would use the order, tokenize,ssplit,pos,lemma,ner,parse,depparse,coref ...I'm not sure if that is a factor at all. If you could run some experiments on what annotations actually cause this error that would be helpful.

@legolego
Copy link
Author

Sure, it looks like the parse annotator is the one then. If I remove just it from both the command to start the server and from the client line in python, the code runs with no error, but also nothing prints following: const_parse: , but finish does print. Examining the value of sentence.parseTree shows this:
Traceback (most recent call last): File "C:\\Program Files\\JetBrains\\PyCharm 2019.2.2\\helpers\\pydev\\_pydevd_bundle\\pydevd_resolver.py", line 178, in _getPyDictionary attr = getattr(var, n) AttributeError: Extensions

If I start the server with:
c:\stanford-corenlp-full-2018-10-05>java -Xms2048m -Xmx8024m -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 120000 -inputFormat 'text' -outputFormat 'json' -be_quiet false -serializer 'edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer' -tokenize.options 'ptb3Escaping=false,invertible=true' -tokenize.language 'en' -annotators parse

I get this output (with tokenize, ssplit, pos annotators starting too, not sure why):
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called --- [main] INFO CoreNLP - setting default constituency parser [main] INFO CoreNLP - using SR parser: edu/stanford/nlp/models/srparser/englishSR.ser.gz [main] INFO CoreNLP - Threads: 12 [main] INFO CoreNLP - Starting server... [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000 [pool-1-thread-3] INFO CoreNLP - [/0:0:0:0:0:0:0:1:64097] API call w/annotators tokenize,ssplit,pos,parse in said housing for light rays, which are received opening scene to be path to said path at a position a plurality of lenticular elements of width P which face in the having a facing and for the being in that it, sides of an aperture movable transversely said path at said position, from said first side to said for said light rays sequentially to the material moving across said screen time for and synchronously moving said screen, substantially throughout said predetermined time for exposure, in the direction as the rays sequentially expose said photosensitive material and over a distance substantially equal to the long and his. [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos [pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec]. [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse [pool-1-thread-3] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ... done [4.1 sec].

My code is as follows. text4 fails, text5 completes, the only difference is the first word 'in'. With text4 and this code I get the protobuf error.

from stanfordnlp.server import CoreNLPClient
#import corenlp
import sys
# https://stanfordnlp.github.io/stanfordnlp/corenlp_client.html
# example text
print(sys.version)
print('---')
print('input text')
print('')

text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."

text4 = "in said housing for light rays, which are received opening scene to be path to said path at a position a plurality of lenticular elements of width P which face in the having a facing and for the being in that it, sides of an aperture movable transversely said path at said position, from said first side to said for said light rays sequentially to the material moving across said screen time for and synchronously moving said screen, substantially throughout said predetermined time for exposure, in the direction as the rays sequentially expose said photosensitive material and over a distance substantially equal to the long and his."
text5 = "said housing for light rays, which are received opening scene to be path to said path at a position a plurality of lenticular elements of width P which face in the having a facing and for the being in that it, sides of an aperture movable transversely said path at said position, from said first side to said for said light rays sequentially to the material moving across said screen time for and synchronously moving said screen, substantially throughout said predetermined time for exposure, in the direction as the rays sequentially expose said photosensitive material and over a distance substantially equal to the long and his."


text = text4
print(text)

# set up the client
print('---')
print('starting up Java Stanford CoreNLP Server...')
# annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'depparse', 'coref'],
#                    timeout=70000, memory='16G', threads=10, be_quiet=False
# set up the client
with CoreNLPClient(start_server=False, be_quiet=False, endpoint="http://localhost:9000", annotators=['parse']) as client:
    # submit the request to the server
    ann = client.annotate(text)

    # get the first sentence
    sentence = ann.sentence[0]

    print("sentence: ", sentence)

    # get the constituency parse of the first sentence
    print('---')
    print('constituency parse of first sentence')
    constituency_parse = sentence.parseTree
    print("const_parse: ", constituency_parse)
    print("finish")

What is the correct way to call the annotators, from the DOS window or in the client line in the code?

@yuhaozhang
Copy link
Member

Since parse will require tokenize, ssplit and pos as dependencies, CoreNLP server will automatically turn them on for you. So it is normal to see that tokenize, ssplit and pos are added when parse is supplied.

Starting the CoreNLP server in command line and in the Python CoreNLPClient initialization is equivalent - the underlying Python init function will essentially make a system call to start the CoreNLP server, so I won't worry about that.

For your given example above, again weirdly I was not able to reproduce the issue on either MacOS or Linux. I am asking a colleague to test it on Windows for me and will get back if he can reproduce the issue.

My current guess is that there was nothing wrong with the Python-end protobuf call, but that the CoreNLP server somehow messed up the output serialized string on a long input sentence, and when the Python-end protobuf tried to decode the serialized string it was unable to do so. Do you still see the same protobuf error if you change the first word "in" in text4 into some other words?

@AngledLuffa
Copy link
Collaborator

I'm not seeing the same error on Windows. In a clean download of corenlp, I ran

java -Xms2048m -Xmx8024m -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 120000 -inputFormat 'text' -outputFormat 'json' -be_quiet false -serializer 'edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer' -tokenize.options 'ptb3Escaping=false,invertible=true' -tokenize.language 'en' -annotators "tokenize, ssplit, pos, parse"

I then copy & pasted your example code and got constituency parses back for both text4 and text5.

My python protobuf is 3.6.1, stanfordnlp 0.2.0, not sure what else would be relevant.

Do you have any further insight on how to trigger a problem?

@AngledLuffa
Copy link
Collaborator

I upgrade my python protobuf to 3.10.0 and still got results for the const_parse for both sentences.

@AngledLuffa
Copy link
Collaborator

Not being there it's hard to know for certain but it might be something in your environment. You could do various things like make a new virtual environment and installing stanfordnlp into that virtualenv and checking if the problem persists.

@legolego
Copy link
Author

Thank you again for replying! :) Ok, I tried replacing "in" in text4 with "int", "it", "i", and "bob", and they all worked, and "in" again failed. My results from pip freeze are:

(patentmoto3) C:\gitProjects\patentmoto3>pip freeze
certifi==2019.9.11
cffi==1.13.2
chardet==3.0.4
corenlp-protobuf==3.8.0
idna==2.8
mkl-fft==1.0.15
mkl-random==1.1.0
mkl-service==2.3.0
numpy==1.17.2
olefile==0.46
Pillow==6.2.1
protobuf==3.10.0
pycparser==2.19
requests==2.22.0
scipy==1.3.1
six==1.13.0
stanford-corenlp==3.9.2
stanfordnlp==0.2.0
torch==1.3.1
torchvision==0.4.2
tqdm==4.38.0
urllib3==1.25.7
wincertstore==0.2

It's also weird that both this stanfordnlp library, and the other python-stanford-corenlp library both fail similarly... what would I look for in my environment that might cause that?
I'll try making a new environment, though I have two now... one with venv and one with conda, and they both fail.

@yuhaozhang
Copy link
Member

As I said earlier, I think it is also possible that there is nothing wrong with your Python environment, but that something is wrong on the CoreNLP side. I will also suggest you to try to reinstall CoreNLP, and then rerun your example.

@J38
Copy link
Collaborator

J38 commented Nov 20, 2019

You should make sure the protobuf.jar in your CLASSPATH is the version distributed with Stanford CoreNLP as well.

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Nov 20, 2019 via email

@legolego
Copy link
Author

Thank you for your suggestion to re-download corenlp! That wasn't exactly it, but it led me to what I think is the answer :) I downloaded 3.9.2 and put it in its own new folder, and called like like @AngledLuffa did just above, and it worked! Then I started downloading both supplemental 3.9.2 english model jars here: https://stanfordnlp.github.io/CoreNLP/ . First the small one(kbp) downloaded and I ran with it, and it worked, then the large one finished, and it failed! Removing the large model jar (not the kbp, stanford-english-corenlp-2018-10-05-models.jar) fixed it! It works now in both the conda and venv environments. Thank you very much!
Is there something I'm doing wrong with the models apart from putting them into the same directory as corenlp? Could you let me know if the mac and linux versions fail here too? Any idea when a new compiled version be available to download? Thank you!

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Nov 20, 2019 via email

@legolego
Copy link
Author

Sure... this is the output without either library:

c:\sstanford-corenlp-full-2018-10-05>java -Xms2048m -Xmx8024m -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 120000 -inputFormat 'text' -outputFormat 'json' -be_quiet false -serializer 'edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer' -tokenize.options 'ptb3Escaping=false,invertible=true' -tokenize.language 'en' -annotators "tokenize, ssplit, pos, parse"
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - setting default constituency parser
[main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
[main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
[main] INFO CoreNLP - to use shift reduce parser download English models jar from:
[main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
[main] INFO CoreNLP -     Threads: 12
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
[pool-1-thread-3] INFO CoreNLP - [/0:0:0:0:0:0:0:1:62698] API call w/annotators tokenize,ssplit,pos,parse
in said housing for light rays, which are received opening scene to be path to said path at a position a plurality of lenticular elements of width P which face in the having a facing and for the being in that it, sides of an aperture movable transversely said path at said position, from said first side to said for said light rays sequentially to the material moving across said screen time for and synchronously moving said screen, substantially throughout said predetermined time for exposure, in the direction as the rays sequentially expose said photosensitive material and over a distance substantially equal to the long and his.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].

and the output with the non-kbp model added to the directory:

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - setting default constituency parser
[main] INFO CoreNLP - using SR parser: edu/stanford/nlp/models/srparser/englishSR.ser.gz
[main] INFO CoreNLP -     Threads: 12
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
[pool-1-thread-3] INFO CoreNLP - [/0:0:0:0:0:0:0:1:62743] API call w/annotators tokenize,ssplit,pos,parse
in said housing for light rays, which are received opening scene to be path to said path at a position a plurality of lenticular elements of width P which face in the having a facing and for the being in that it, sides of an aperture movable transversely said path at said position, from said first side to said for said light rays sequentially to the material moving across said screen time for and synchronously moving said screen, substantially throughout said predetermined time for exposure, in the direction as the rays sequentially expose said photosensitive material and over a distance substantially equal to the long and his.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ... done [4.2 sec].

There's nothing obvious to me in there...

@J38
Copy link
Collaborator

J38 commented Nov 20, 2019

Different parser models are used in the 2 cases. So the error occurs with the englishSR parser but not with englishPCFG.

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Nov 20, 2019 via email

@AngledLuffa
Copy link
Collaborator

Progress! I can recreate the error on my end. Now I guess I should try to fix it

@legolego
Copy link
Author

Yay! I'm glad I wasn't making it up! :D

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Nov 22, 2019 via email

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Nov 23, 2019 via email

@legolego
Copy link
Author

Ok, that worked for me too, thank you for all your help! :)

@yuhaozhang
Copy link
Member

FYI, I've now made a simple fix at stanfordnlp/stanfordnlp@a55953f. The client code will now catch this DecodeError when it happens, throw a user warning for it and output an empty document object. The bottom line is that a decoding error on a single sentence should not crash the entire program.

Since this is in the dev branch, it will become available in the next release.

@AngledLuffa
Copy link
Collaborator

On the java side, we're going to flatten any trees which have more than 80 layers (trees like this are basically going to be useless anyway). We could in theory change the wire format, but from what I understand, other people have written modules which process the current wire format, so we're more or less locked into this format for now. This will be available in the next release of corenlp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants